'''`scriptSplit`''' (formerly known as '''`cmdSplit`'''), by [dgp], parses a [script] into its constituent commands while properly handling semicolon-delimited commands and the "semicolon in a comment" problem. It was written to support parsing of class bodies in an itcl-like, pure Tcl, OO framework into Tcl commands. [PYK] 2014-10-08: '''`scriptSplit`''' was previously named '''`cmdSplit`''', and '''`cmdSplit`''' was previously named '''`wordSplit`'''. They were recently renamed in order to provide a set of commands whose names are consistent and descriptive. Check the history of this page for some discussion. ** See Also ** [cmdStream]: [Config file using slave interp], by [AMG]: more-or-less the same thing, implemented using a slave [interp]reter ** Description ** `scriptSplit` returns a list of the commands in a script. The original post is ''[http://groups.google.com/group/comp.lang.tcl/msg/cfe2d00fc7b291be%|%How to split a string into elements exactly as eval would do Options, comp.lang.tcl, 1998-09-07]''. [PYK] 2013-04-14: I've modified `scriptSplit` to not filter out comments, and provided a simple helper script that does that if desired: ====== nocomments [scriptSplit $script] ====== code: ====== proc scriptSplit {script} { set commands {} set chunk {} foreach line [split $script \n] { append chunk $line if {[info complete $chunk\n]} { # $chunk ends in a complete Tcl command, and none of the # newlines within it end a complete Tcl command. If there # are multiple Tcl commands in $chunk, they must be # separated by semi-colons. set cmd {} foreach part [split $chunk \;] { append cmd $part if {[info complete $cmd\n]} { set cmd [string trimleft $cmd] #drop empty commands if {$cmd eq {}} { continue } if {[string match \#* $cmd]} { #the semi-colon was part of a comment. Add it back append cmd \; } else { lappend commands $cmd set cmd {} } } else { # No complete command yet. # Replace semicolon and continue append cmd \; } } #if there was an "inline" comment, it will be in cmd, with an #additional semicolon at the end if {$cmd ne {}} { lappend commands [string replace $cmd[set cmd {}] end end] } set chunk {} } else { # No end of command yet. Put the newline back and continue append chunk \n } } if {![string match {} [string trimright $chunk]]} { return -code error "Can't parse script into a\ sequence of commands.\n\tIncomplete\ command:\n-----\n$chunk\n-----" } return $commands } proc nocomments {commands} { set res [list] foreach command $commands { if {![string match \#* $command]} { lappend res $command } } return $res } ====== ** cmdSplit ** [Sarnold]: `cmdSplit` takes a command and returns its arguments as a list. ====== proc cmdSplit command { if {![info complete $command]} {error "non complete command"} set res ""; # the list of words set chunk "" foreach word [split $command " \t"] { # testing each word until the word being tested makes the # command up to it complete # example: # set "a b" # set -> complete, 1 word # set "a -> not complete # set "a b" -> complete, 2 words append chunk $word if {[info complete "$res $chunk\n"]} { lappend res $chunk set chunk "" } else { append chunk " " } } lsearch -inline -all -not $res {} ;# empty words denote consecutive whitespace } ====== ---- [aspect]: forgive my foolishness, but what is `cmdSplit` for? From the description it sounds like `cmdSplit $command` means `lrange $command 1 end` but it seems to do something different. If you want the elements of `$command` as a list, just use `$command`! [AMG]: `cmdSplit` splits an arbitrary string by whitespace, then attempts to join the pieces according to the result of `[info complete]`. This results in a list in which each element embeds its original quote characters. Since an odd number of trailing backslashes doesn't cause `[info complete]` to return false, `cmdSplit` doesn't correctly recognize backslashes used to quote spaces. I agree that `cmdSplit` doesn't appear to serve a useful purpose. Its input should already be a valid, directly usable list. [aspect]: it also does strange things if there are consecutive spaces in the input. "each element embeds its original quote characters" seems to be the important characteristic, but I can't think of a use-case where this would be desirable, hoping that [Sarnold] can elaborate on his original intention so the example can be focussed (and corrected?). [PYK] 2014-09-11: Due to [dodekalogue%|%command substitution], several words in the "raw" list that composes a command might contribute to one word in the "logical" list that is that command. `cmdSplit` parses the command into its logical words. Keeping the braces and quotation marks allows the consumer to know how Tcl would have dealt with each word. To eliminate false positives, I added a `\n` to the `[info complete..]` test in `cmdSplit`, and alsoe, here is another variant that is different only in style: ====== proc cmdSplit2 cmd { if {![info complete $cmd]} { error [list {not a complete command} $cmd] } set cmdwords {} set realword {} foreach word [split $cmd " \t"] { set realword [concat $realword[set realword {}] $word] if {[info complete $realword\n]} { lappend cmdwords $realword set realword {} } } return $cmdwords } ====== example: ====== % cmdSplit2 {set "var one" [lindex {one "two three" four} 1]} #-> set {"var one"} {[lindex {one "two three" four} 1]} ====== [aspect]: that example made it clear! I've wanted something like this before, It could have potential for some interesting combinations with [Scripted list] and [Annotating words for better specifying procs in Tcl9]. I added an [lsearch] at the end of cmdSplit to get rid of the "empty word" artifacts caused by consecutive whitespace. [PYK]'s implementation needs a bit of tweaking to handle this better: ====== % cmdSplit {{foo bar} "$baz quz 23" lel\ lal lka ${foo b bar}} {{foo bar}} {"$baz quz 23"} {lel\ lal} lka {${foo b bar}} % cmdSplit2 {{foo bar} "$baz quz 23" lel\ lal lka ${foo b bar}} {{foo bar}} {} {"$baz quz 23"} {} {} {lel\ lal} lka {${foo b bar}} ====== [PYK] 2014-09-12: Yes, it does need some tweaking. In addition to the issue noted, both `cmdSplit` and the previous `cmdSplit2` improperly converted tab characters within a word into spaces. To fix that, it's necessary to use `[regexp]` instead of `[split]` to get a handle on the actual delimiter. Here is a new `cmdSplit2` that I think works correctly: ====== proc cmdSplit2 cmd { if {![info complete $cmd]} { error [list {not a complete command} $cmd] } set words {} set logical {} set cmd [string trimleft $cmd[set cmd {}]] while {[regexp {([^\s]*)(\s+)(.*)} $cmd full first delim last]} { append logical $first if {[info complete $logical\n]} { lappend words $logical set logical {} } else { append logical $delim } set cmd $last[set last {}] } if {$cmd ne {}} { append logical $cmd } if {$logical ne {}} { lappend words $logical } return $words } ====== ** `wordsplit` ** [PYK] 2014-10-07: `wordsplit` accepts a single word and splits it into its components. This implementation attempts to split a word exactly as Tcl would, minding details such just what exactly Tcl considers whitespace, and only interpreting `\whitespace` in a braced word specially if there is an odd number of backslashes preceding the newline character. `wordsplit` is a little more complicated than the other scripts on this page because it doesn't get as much mileage out of `[info complete]`. [aspect] has also recently [http://paste.tclers.tk/3304%|%produced an implementation], but it fails a good number of the tests developed for the implementation below. ====== #sl is "scripted list", http://wiki.tcl.tk/39972 proc wordsplit word { set parts {} set first [string index $word 0] if {$first in {\" \{}} { set last [string index $word end] set wantlast [dict get {\" \" \{ \}} $first] if {$last ne $wantlast} { error [list [list missing trailing [ dict get {\" quote \{ brace} $first]]] } set word [string range $word[set word {}] 1 end-1] } if {$first eq "\{"} { set obracecount 0 set cbracecount 0 set part {} while {$word ne {}} { switch -regexp -matchvar rematch $word [sl { #these seem to be the only characters Tcl accepts as whitespace #in this context {^([{}])(.*)} { if {[string index $word 0] eq "\{"} { incr obracecount } else { incr cbracecount } lassign $rematch -> 1 word append part $1 } {^(\\[{}])(.*)} { lassign $rematch -> 1 word append part $1 } {^(\\+\n[\x0a\x0b\x0f\x20]*)(.*)} { lassign $rematch -> 1 word if {[regexp -all {\\} $1] % 2} { if {$part ne {}} { lappend parts $part set part {} } lappend parts $1 } else { append part $1 } } {^(.+?(?=(\\?[{}])|(\\+\n)|$))(.*$)} { lassign $rematch -> 1 word append part $1 } default { error [list {no match} $word] } }] } if {$cbracecount != $obracecount} { error [list {unbalanced braces in braced word}] } if {$part ne {}} { lappend parts $part } return $parts } else { while {$word ne {}} { set part {} switch -regexp -matchvar rematch $word [sl { #order matters in some cases below {^(\$(?:::|[A-Za-z0-9])+\()(.*)} - {^(\[)(.*)} { if {[string index $word 0] eq {$}} { set re {^([^)]+\))(.*)} set errmsg {incomplete variable name} } else { set re {^([^]]*])(.*)} set errmsg {incomplete command} } lassign $rematch -> 1 word while {$word ne {}} { set part {} regexp $re $word -> part word append 1 $part if {[info complete $1]} { lappend parts $1 break } elseif {$word eq {}} { error [list $errmsg $1] } } } #these seem to be the only characters Tcl accepts as whitespace #in this context {^(\\\n[\x0a\x0b\x0f\x20]*)(.*)} - {^(\$(?:::|[A-Za-z0-9])+)(.*)} - {^(\$\{[^\}]*\})(.*)} - #detect a single remaining backlsash or dollar character here #to avoid a more complicated re below {^(\\|\$)($)} - {^(\\[0-7]{3})(.*)} - {^(\\U[0-9a-f]{8})(.*)} - {^(\\u[0-9a-f]{4})(.*)} - {^(\\x[0-9a-f]{2})(.*)} - {^(\\.)(.*)} - #lookahead ensures that .+ matches non-special occurrences of #"$" character #non greedy match here, so make sure .*$ stretches the match to #the end, so that something ends up in $2 {(?x) #non-greedy so that the following lookahead stops it at the #first chance ^(.+? #stop at and backslashes (?=(\\ #but only if they aren't at the end of the word (?!$)) #also stop at brackets |(\[) #and stop at variables |(\$[\{A-Za-z0-9]) #or at the end of the word |$) ) #the rest of the word. $ stretches the non-greedy re out to #the end of the word (.*$)} { lassign $rematch -> 1 word lappend parts $1 } default { error [list {no match} $word] } }] } } return $parts } ====== <> Parsing | Object Orientation