'''`scriptSplit`''' (formerly known as '''`cmdSplit`'''), by [dgp], parses a [script] into its constituent commands while properly handling semicolon-delimited commands and the "semicolon in a comment" problem.  It was written to support parsing of class bodies in an itcl-like, pure Tcl, OO framework into Tcl commands.

[PYK] 2014-10-08:   '''`scriptSplit`''' was previously named '''`cmdSplit`''',
and '''`cmdSplit`''' was previously named '''`wordSplit`'''.  They were
recently renamed in order to provide a set of commands whose names are
consistent and descriptive.  Check the history of this page for some
discussion. 


** See Also **

   [cmdStream]:   

   [Config file using slave interp], by [AMG]:   more-or-less the same thing, implemented using a slave [interp]reter


** Description **


`scriptSplit` returns a list of the commands in a script.  The original post is
''[http://groups.google.com/group/comp.lang.tcl/msg/cfe2d00fc7b291be%|%How to
split a string into elements exactly as eval would do Options, comp.lang.tcl, 1998-09-07]''.

[PYK] 2013-04-14:  I've modified `scriptSplit` to not filter out comments, and
provided a simple helper script that does that if desired:

======
nocomments [scriptSplit $script]
======

code:

======
proc scriptSplit {script} {
    set commands {}
    set chunk {} 
    foreach line [split $script \n] {
        append chunk $line
        if {[info complete $chunk\n]} {
            # $chunk ends in a complete Tcl command, and none of the
            # newlines within it end a complete Tcl command.  If there
            # are multiple Tcl commands in $chunk, they must be
            # separated by semi-colons.
            set cmd {} 
            foreach part [split $chunk \;] {
                append cmd $part
                if {[info complete $cmd\n]} {
                    set cmd [string trimleft $cmd]
                    #drop empty commands
                    if {$cmd eq {}} {
                        continue
                    }
                    if {[string match \#* $cmd]} {
                        #the semi-colon was part of a comment.  Add it back
                        append cmd \;
                    } else {
                        lappend commands $cmd
                        set cmd {}
                    }
                } else {
                    # No complete command yet.
                    # Replace semicolon and continue
                    append cmd \;
                }
            }
            #if there was an "inline" comment, it will be in cmd, with an
            #additional semicolon at the end
            if {$cmd ne {}} {
                lappend commands [string replace $cmd[set cmd {}] end end]
            }
            set chunk {} 
        } else {
            # No end of command yet.  Put the newline back and continue
            append chunk \n
        }
    }
     if {![string match {} [string trimright $chunk]]} {
        return -code error "Can't parse script into a\
                sequence of commands.\n\tIncomplete\
                command:\n-----\n$chunk\n-----"
    }
    return $commands
}

proc nocomments {commands} {
    set res [list]
    foreach command $commands {
        if {![string match \#* $command]} {
            lappend res $command
        }
    }
    return $res
}
======


** cmdSplit **

[Sarnold]: `cmdSplit` takes a command and returns its arguments as a list.

======
proc cmdSplit command {
    if {![info complete $command]} {error "non complete command"}
    set res ""; # the list of words
    set chunk ""
    foreach word [split $command " \t"] {
        # testing each word until the word being tested makes the
        # command up to it complete
        # example:
        # set "a b"
        # set -> complete, 1 word
        # set "a -> not complete
        # set "a b" -> complete, 2 words
        append chunk $word
        if {[info complete "$res $chunk\n"]} {
            lappend res $chunk
            set chunk ""
        } else {
            append chunk " "
        }
    }
    lsearch -inline -all -not $res {}   ;# empty words denote consecutive whitespace
}
======

----

[aspect]: forgive my foolishness, but what is `cmdSplit` for?  From the
description it sounds like `cmdSplit $command` means `lrange $command 1
end` but it seems to do something different.  If you want the elements of
`$command` as a list, just use `$command`!

[AMG]: `cmdSplit` splits an arbitrary string by whitespace, then attempts
to join the pieces according to the result of `[info complete]`.  This results
in a list in which each element embeds its original quote characters.  Since an
odd number of trailing backslashes doesn't cause `[info complete]` to return
false, `cmdSplit` doesn't correctly recognize backslashes used to quote
spaces.

I agree that `cmdSplit` doesn't appear to serve a useful purpose.  Its
input should already be a valid, directly usable list.

[aspect]: it also does strange things if there are consecutive spaces in the
input.  "each element embeds its original quote characters" seems to be the
important characteristic, but I can't think of a use-case where this would be
desirable, hoping that [Sarnold] can elaborate on his original intention so
the example can be focussed (and corrected?). 

[PYK] 2014-09-11:  Due to [dodekalogue%|%command substitution], several words in the "raw" list that composes a command might contribute to one word in the "logical" list that is that command. `cmdSplit` parses the command into its logical words.
Keeping the braces and quotation marks allows the consumer to know how Tcl
would have dealt with each word.

To eliminate false positives, I added a `\n` to the `[info complete..]` test in `cmdSplit`, and alsoe, here is another variant that is different only in style: 

======
proc cmdSplit2 cmd {
    if {![info complete $cmd]} {
        error [list {not a complete command} $cmd]
    }
    set cmdwords {}
    set realword {}
    foreach word [split $cmd " \t"] {
        set realword [concat $realword[set realword {}] $word]
        if {[info complete $realword\n]} {
            lappend cmdwords $realword
            set realword {}
        }
    }
    return $cmdwords
}
======

example:

======
% cmdSplit2 {set "var one" [lindex {one "two three" four} 1]} 
#-> set {"var one"} {[lindex {one "two three" four} 1]}
======

[aspect]:  that example made it clear!  I've wanted something like this before, It could have potential for some interesting combinations with [Scripted list] and [Annotating words for better specifying procs in Tcl9].

I added an [lsearch] at the end of cmdSplit to get rid of the "empty word" artifacts caused by consecutive whitespace.  [PYK]'s implementation needs a bit of tweaking to handle this better:

======
% cmdSplit {{foo  bar}  "$baz   quz 23"   lel\ lal lka ${foo b  bar}}
{{foo  bar}} {"$baz   quz 23"} {lel\ lal} lka {${foo b  bar}}
% cmdSplit2 {{foo  bar}  "$baz   quz 23"   lel\ lal lka ${foo b  bar}}
{{foo bar}} {} {"$baz quz 23"} {} {} {lel\ lal} lka {${foo b bar}}
======

[PYK] 2014-09-12:  Yes, it does need some tweaking. In addition to the issue
noted, both `cmdSplit` and the previous `cmdSplit2` improperly converted tab
characters within a word into spaces.  To fix that, it's necessary to use
`[regexp]` instead of `[split]` to get a handle on the actual delimiter.  Here is a new `cmdSplit2` that I think works correctly:

======
proc cmdSplit2 cmd {
    if {![info complete $cmd]} {
        error [list {not a complete command} $cmd]
    }
    set words {}
    set logical {}
    set cmd [string trimleft $cmd[set cmd {}]]
    while {[regexp {([^\s]*)(\s+)(.*)} $cmd full first delim last]} {
        append logical $first
        if {[info complete $logical\n]} {
            lappend words $logical
            set logical {}
        } else {
            append logical $delim
        }
        set cmd $last[set last {}]
    }
    if {$cmd ne {}} {
        append logical $cmd
    }
    if {$logical ne {}} {
        lappend words $logical 
    }
    return $words
}
======


** `wordsplit` **

[PYK] 2014-10-07: `wordsplit` accepts a single word and splits it into its
components.  This implementation attempts to split a word exactly as Tcl would,
minding details such just what exactly Tcl considers whitespace, and only
interpreting `\<newline>whitespace` in a braced word specially if there is an
odd number of backslashes preceding the newline character.

`wordsplit` is a little more complicated than the other scripts on this page
because it doesn't get as much mileage out of `[info complete]`.

[aspect] has also recently [http://paste.tclers.tk/3304%|%produced an
implementation], but it fails a good number of the tests developed for the
implementation below.

======
#sl is "scripted list", http://wiki.tcl.tk/39972
proc wordsplit word {
    set parts {}
    set first [string index $word 0]
    if {$first in {\" \{}} {
        set last [string index $word end]
        set wantlast [dict get {\" \" \{ \}} $first]
        if {$last ne $wantlast} {
            error [list [list missing trailing [
                dict get {\" quote \{ brace} $first]]]
        }
        set word [string range $word[set word {}] 1 end-1]
    }
    if {$first eq "\{"} {
        set obracecount 0
        set cbracecount 0
        set part {}
        while {$word ne {}} {
            switch -regexp -matchvar rematch $word [sl {
                #these seem to be the only characters Tcl accepts as whitespace
                #in this context
                {^([{}])(.*)} {
                    if {[string index $word 0] eq "\{"} {
                        incr obracecount
                    } else {
                        incr cbracecount
                    }
                    lassign $rematch -> 1 word 
                    append part $1
                }
                {^(\\[{}])(.*)} {
                    lassign $rematch -> 1 word 
                    append part $1
                }
                {^(\\+\n[\x0a\x0b\x0f\x20]*)(.*)}  {
                    lassign $rematch -> 1 word
                    if {[regexp -all {\\} $1] % 2} {
                        if {$part ne {}} {
                            lappend parts $part
                            set part {}
                        }
                        lappend parts $1
                    } else {
                        append part $1
                    }
                }
                {^(.+?(?=(\\?[{}])|(\\+\n)|$))(.*$)} {
                    lassign $rematch -> 1 word
                    append part $1
                } 
                default {
                    error [list {no match} $word]
                }
            }]
        }
        if {$cbracecount != $obracecount} {
            error [list {unbalanced braces in braced word}]
        }
        if {$part ne {}} {
            lappend parts $part
        }
        return $parts
    } else {
        while {$word ne {}} {
            set part {}
            switch -regexp -matchvar rematch $word [sl {
                #order matters in some cases below

                {^(\$(?:::|[A-Za-z0-9])+\()(.*)} - 
                {^(\[)(.*)} {
                    if {[string index $word 0] eq {$}} {
                        set re {^([^)]+\))(.*)}
                        set errmsg {incomplete variable name}
                    } else {
                        set re {^([^]]*])(.*)}
                        set errmsg {incomplete command}
                    }
                    lassign $rematch -> 1 word
                    while {$word ne {}} {
                        set part {}
                        regexp $re $word -> part word
                        append 1 $part
                        if {[info complete $1]}  {
                            lappend parts $1
                            break
                        } elseif {$word eq {}} {
                            error [list $errmsg $1] 
                        }
                    }
                }

                #these seem to be the only characters Tcl accepts as whitespace
                #in this context
                {^(\\\n[\x0a\x0b\x0f\x20]*)(.*)} -
                {^(\$(?:::|[A-Za-z0-9])+)(.*)} -
                {^(\$\{[^\}]*\})(.*)} -
                #detect a single remaining backlsash or dollar character here
                #to avoid a more complicated re below
                {^(\\|\$)($)} -
                {^(\\[0-7]{3})(.*)} -
                {^(\\U[0-9a-f]{8})(.*)} -
                {^(\\u[0-9a-f]{4})(.*)} -
                {^(\\x[0-9a-f]{2})(.*)} -
                {^(\\.)(.*)} -
                #lookahead ensures that .+ matches non-special occurrences of
                #"$" character
                #non greedy match here, so make sure .*$ stretches the match to
                #the end, so that something ends up in $2
                {(?x)
                    #non-greedy so that the following lookahead stops it at the
                    #first chance 
                    ^(.+?
                        #stop at and backslashes
                        (?=(\\
                            #but only if they aren't at the end of the word 
                            (?!$))
                        #also stop at brackets
                        |(\[)
                        #and stop at variables
                        |(\$[\{A-Za-z0-9])
                        #or at the end of the word
                        |$)
                    )
                    #the rest of the word. $ stretches the non-greedy re out to
                    #the end of the word
                    (.*$)} {

                    lassign $rematch -> 1 word 
                    lappend parts $1
                } 
                default {
                    error [list {no match} $word]
                }
            }]
        }
    }
    return $parts
}
======


<<categories>> Parsing | Object Orientation