Version 42 of split

Updated 2005-12-17 00:36:08

split - Split a string into a proper Tcl list

 split string ?splitChars? 

Returns a list created by splitting string at each character that is in the splitChars argument. Each element of the result list will consist of the characters from string that lie between instances of the characters in splitChars. Empty list elements will be generated if string contains adjacent characters in splitChars, or if the first or last character of string is in splitChars. If splitChars is an empty string then each character of string becomes a separate element of the result list. SplitChars defaults to the standard white-space characters. For example,

 split "comp.unix.misc" .

returns "comp unix misc" and

 split "Hello world" {}

returns "H e l l o { } w o r l d".

[DKF]: I believe there's a standard (ANSI? POSIX?) somewhere.  But the answer
[DKF]: I believe there's a standard (ANSI? POSIX?) somewhere.  
But the answer includes "space", "tab", and "newline".
[escargo]: By "tab" do you mean both horizontal tab (ASCII 9) and vertical tab
''[escargo]'' - By "tab" do you mean both horizontal tab (ASCII 9) and vertical tab
(ASCII 11)?  (See http://www.asciitable.com/)  Arguments could be made for most
of the ASCII characters under 33.
[Strick]: Let's ask Tcl what it thinks are white:

 $ env | grep en_
 LANG=en_US.UTF-8
 $ cat what-chars-does-split-think-are-white.tcl
 for {set i 0} {$i<65536} {incr i} {
   if {[llength [format "/%c/" $i]] > 1} { puts -nonewline "$i " }
 }
 $ tclsh what-chars-does-split-think-are-white.tcl
 9 10 11 12 13 32 $
[escargo] 2005-04-01 :

''[escargo] 4 Jan 2005'' -
9 = ASCII TAB, 10 = ASCII LF (line feed), 11 = ASCII VT (vertical tab), 12 = ASCII FF (form feed),
13 = ASCII CR (carriage return), and of course 32 = ASCII Space.
9 = ASCII TAB, 10 = ASCII LF (line feed), 11 = ASCII VT (vertical tab), 12 =
I would have thought that the separator characters would count as white space (28-31, FS, GS, RS, US),
but I guess they are regarded as "nonprinting" characters.
I would have thought that the separator characters would count as white space

[Strick]: Oops, i forgot to actually use split in my script above.  So now I
[Strick]: Oops, i forgot to actually use split in my script above.  
So now I test four different notions of white, and get three different answers.
I understand why Tcl's builtin list-splitting rules must be fixed, regardless of locale.
But it seems 'split' should use the list-splitting rule or the the 'string is space' rule, 
but it uses its own (pre-unicode?) rule:
 $ cat what-chars-does-split-think-are-white.tcl
 puts "tcl=[info patch] LANG=$env(LANG)"
puts -nonewline {according to llength: }
 puts -nonewline "according to llength: "
 for {set i 0} {$i<65536} {incr i} {
   if {[llength [format "/%c/" $i]] > 1} { puts -nonewline "$i " }
 }
 puts ""
puts -nonewline {according to split: }
 puts -nonewline "according to split: "
 for {set i 0} {$i<65536} {incr i} {
   if {[llength [split [format "/%c/" $i]]] > 1} { puts -nonewline "$i " }
 }
 puts ""
puts -nonewline {according to 'string is space': }
 puts -nonewline "according to 'string is space': "
 for {set i 0} {$i<65536} {incr i} {
   if {[string is space [format "%c" $i]]} { puts -nonewline "$i " }
 }
 puts ""
puts -nonewline {according to regexp {\s}: }
 puts -nonewline "according to regexp {\\s}: "
 for {set i 0} {$i<65536} {incr i} {
   if {[regexp {\s} [format "%c" $i]]} { puts -nonewline "$i " }
 }
 puts ""
$
 $
 $ tclsh what-chars-does-split-think-are-white.tcl
 tcl=8.4.7 LANG=en_US.UTF-8
 according to llength: 9 10 11 12 13 32
 according to split: 9 10 13 32
 according to 'string is space': 9 10 11 12 13 32 160 5760 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8232 8233 8239 12288
 according to regexp {\s}: 9 10 11 12 13 32 160 5760 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8232 8233 8239 12288
 $
----
Note that the argument named ''splitChars'' above is a series of 0 to n individual characters.  
However, if you want to split on a specific sequence of 2 or more characters together, or if you want to split on a regular expression, split will not work for you.  
See [Tcllib]'s [textutil]::split::splitx for that functionality.

[SS] 2004/01/31 - or you can use the following function:
This version is recursive, so it may be better to rewrite it if you plan to use
 proc wsplit {string sep} {
     set first [string first $sep $string]
     if {$first == -1} {
         return [list $string]
     } else {
         set l [string length $sep]
         set left [string range $string 0 [expr {$first-1}]]
         set right [string range $string [expr {$first+$l}] end]
         return [concat [list $left] [wsplit $right $sep]]
     }
 }

This version is recursive, so it may be better to rewrite it if you plan to use the function
against very long strings with many separators. The difference between wsplit and splitx
is that splitx uses regexp, so it may create problems with unknown separators.
[IL] 2005-01-03: on the near anniversary of this proc, the iterative version,
[IL] 2005/01/03 - on the near anniversary of this proc, the iterative version, 
quick n dirty since I'm in a hurry to parse some html...
 proc wsplit { str sepStr } {

    set strList   [list]
    set sepLength [string length $sepStr]
   while {[set index [string first $sepStr $str]] != -1} {
    while { [set index [string first $sepStr $str]] != "-1" } {

        set left [string range $str 0 [expr $index + $sepLength - 1]]
        set str  [string range $str [expr $index + $sepLength + 1] end]
        lappend strList $left
    }

    return $strList
 }

 hmm use this version instead, the string first doesn't catch strings sepstrs connected to
 the ones you want 

 proc wsplit { str sepStr } {

    if { ![regexp $sepStr $str] } { return $str }

    set strList   [list]

    set pattern "(.*?)$sepStr"
    while { [regexp $pattern $str match left] } {

        lappend strList $left
        regsub $pattern $str "" str
    }

    lappend strList $str

    return $strList
 }

RS writes recently:

Note that the wsplit can be done simpler:

  1. split on that single char

 proc wsplit {str sep} {
   split [string map [list $sep \0] $str] \0
 }
 % wsplit This<>is<>a<>test. <>
 This is a test.


So what should you use when you don't care how many spaces were between tokens, you just want the nonblank tokens in the list and none of the separators? -- escargo RS: Easy, just use a filter:

 proc filter {cond list} {
    set res {}
    foreach element $list {if [$cond $element] {lappend res $element}}
    set res
 }
 % filter llength [split "a   list   with many   spaces"]
 a list with many spaces

... or use

 % split [regsub -all {[ \t\n]+} "a   list   with many   spaces" { }] 

... or use

 % lreplace "a   list   with many   spaces" 0 -1

See Counting characters in a string where split was pretty good...


See also the man page at http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/split.htm .


Splitting strings with embedded strings - list


Tcl syntax help - Arts and crafts of Tcl-Tk programming - Category Command - Category String Processing