split

Difference between version 89 and 90 - Previous - Next
'''[http://www.tcl.tk/man/tcl/TclCmd/split.htm%|%split]''', a [Tcl Commands%|%built-in] [Tcl] command, splits a [string] into a [list]



** Synopsis **

    :   '''split''' ''string'' ?''splitChars''?



** Documentation **

   [http://www.tcl.tk/man/tcl8.5/TclCmd/split.htm%|%official reference]:   



** Description **
''SplitChars'' defaults to the standard white-space characters. 
The result of `split` ireturns a list of substrings of ''string'' that are delimitied
 by any
characters in ''splitChars'', which is a listequence of characters., not a list.
If adjacent characters in
 ''splitCharings'' aoccure adjacentlsoy in ''splitCharsing'', they rdesult will mithe an
empty substring
between those adjacent characters.  If the first character of ''string'' is in
 ''splitChars'', the 
firesult willtem includ the result is the empty substring beforeat the fbegirnning ot ''st
charactering'', 
and if the last character of ''string'' is in ''splitchars'', the last item in
the result wills include the empty string after the laend of ''st charactering''. 


** See aAlso **

   [join]:   

   [list]:   

   [Additional string functions]:      
   [Arts and crafts of Tcl-Tk programming]:   

   [ycl]:   provides `string delimit`, which splits strings on `[string match]`-style patterns or `[regexp]`-style patterns

   [Splitting strings with embedded strings]:   

   [Splitting a string on arbitrary substrings]:   

   [Counting characters in a string]:   where [split] was pretty good...

   [cmdSplit]:   to split as Tcl would split a command into words prior to evaluation.
** Examples **

*** Splitting by dots ***
======
split comp.unix.misc .
======

======none
comp unix misc
======

*** Splitting into characters ***

======
split {Hello world} {}
======

======none
H e l l o { } w o r l d
======

Splitting on the empty string is an optimized case, and is an efficient
operation.

*** Splitting by whitespace: the pitfalls ***

======
split { abc def  ghi}
======
======none
{} abc def {} ghi
======

Usually, if you are splitting by whitespace and do not want those blank fields,
you are better off doing:

======
regexp -all -inline {\S+} { abc def  ghi}
======

======none
abc def ghi
======



** Definition of White-Space Characters **

[ulis]: where in the doc are defined the standard white-space characters?

[DKF]: I believe there's a standard (ANSI? POSIX?) somewhere.  But the answer
includes "space", "tab", and "newline".
[escargo]: By "tab" do you mean both horizontal tab ([ASCII] 9) and vertical tab
([ASCII] 11)?  (See http://www.asciitable.com/)  Arguments could be made for most
of the ASCII characters under 33.

[Strick]: Let's ask Tcl what it thinks are white:

======
$ env | grep en_
LANG=en_US.UTF-8
$ cat what-chars-does-split-think-are-white.tclfor {set i 0} {$i < 65536} {incr i} {
    if {[llength [format "/%c/" $i]] > 1} {
        puts -nonewline "$i "
    }
}
$ tclsh what-chars-does-split-think-are-white.tcl
9 10 11 12 13 32 $
======

[escargo] 2005-04-01 :

9 = ASCII TAB, 10 = ASCII LF (line feed), 11 = ASCII VT (vertical tab), 12 =
ASCII FF (form feed), 13 = ASCII CR (carriage return), and of course 32 = ASCII
Space.

I would have thought that the separator characters would count as white space
(28-31, FS, GS, RS, US), but I guess they are regarded as "nonprinting"
characters.
[DKF]: I actually mean "what does `isspace()` think is whitespace". :^)

[Strick]: Oops, i forgot to actually use split in my script above.  So now I
test four different notions of white, and get three different answers.  I
understand why Tcl's builtin list-splitting rules must be fixed, regardless of
locale.  But it seems 'split' should use the list-splitting rule or the the
`string is space` rule, but it uses its own (pre-unicode?) rule:

======
$ cat what-chars-does-split-think-are-white.tcl
puts "tcl=[info patch] LANG=$env(LANG)"

puts -nonewline {according to llength: }for {set i 0} {$i < 65536} {incr i} {
    if {[llength [format "/%c/" $i]] > 1} {
        puts -nonewline "$i "
    }
}puts {} 

puts -nonewline {according to split: }
for {set i 0} {$i < 65536} {incr i} {    if {[llength [split [format /%c/ $i]]] > 1} {
        puts -nonewline "$i "
    }
}puts {} 

puts -nonewline {according to 'string is space': }
for {set i 0} {$i < 65536} {incr i} {    if {[string is space [format %c $i]]} { puts -nonewline "$i " }
}puts {} 

puts -nonewline {according to regexp {\s}: }
for {set i 0} {$i < 65536} {incr i} {    if {[regexp {\s} [format "%c" $i]]} {
        puts -nonewline "$i "
    }
}
puts {}

$
$ tclsh what-chars-does-split-think-are-white.tcl
tcl=8.4.7 LANG=en_US.UTF-8
according to llength: 9 10 11 12 13 32
according to split: 9 10 13 32
according to 'string is space': 9 10 11 12 13 32 160 5760 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8232 8233 8239 12288
according to regexp {\s}: 9 10 11 12 13 32 160 5760 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8232 8233 8239 12288
$
======

[escargo] 2006-01-27: If split used chars 9 10 11 12 13 32 then there would be
only two sets, with the smaller set as a proper subset of the larger set.  The
two characters that would have to be added are the vertical tab and form feed.



** Splitting on Substrings **

''splitChars'' is a series of 0 to n individual characters.  However, if you
want to split on a specific sequence of 2 or more characters together, or if
you want to split on a regular expression, split will not work for you.  See
[Tcllib]'s [textutil]::[splitx], or [ycl%|%ycl::string::delimit] for that
functionality.

[SS] 2004-01-31: Or you can use the following function:

======
proc wsplit {string sep} {
    set first [string first $sep $string]
    if {$first == -1} {
        return [list $string]
    } else {
        set l [string length $sep]
        set left [string range $string 0 [expr {$first-1}]]
        set right [string range $string [expr {$first+$l}] end]
        return [concat [list $left] [wsplit $right $sep]]
    }
}
======

This version is recursive, so it may be better to rewrite it if you plan to use
the function against very long strings with many separators. The difference
between wsplit and [splitx] is that [splitx] uses [regexp], so it may create
problems with unknown separators.

[IL] 2005-01-03: on the near anniversary of this proc, the iterative version,
quick-n-dirty since I'm in a hurry to parse some html...

======proc wsplit { str sepStr } {
   set strList   {}
   set sepLength [string length $sepStr]

   while {[set index [string first $sepStr $str]] != -1} {
       set left [string range $str 0 [expr {$index + $sepLength - 1}]]
       set str  [string range $str [expr {$index + $sepLength + 1}] end]
       lappend strList $left
   }
   return $strList
}
======

hmm use this version instead, the string first doesn't catch strings sepstrsconnected to the ones you want 

======
proc wsplit {str sepStr} {
    if {![regexp $sepStr $str]} {
        return $str}
    set strList {}
    set pattern (.*?)$sepStr
    while {[regexp $pattern $str match left]} {
        lappend strList $left
        regsub $pattern $str {} str
    }
    lappend strList $str
    return $strList
}
======

[RS] writes recently:

Note that the wsplit can be done simpler:

    1. map the separating string to a single char that cannot appear in the string
    2. split on that single char

======
proc wsplit {str sep} {
  split [string map [list $sep \0] $str] \0
}
% wsplit This<>is<>a<>test. <>
This is a test.
======

----

[Sarnold] 2006-06-21: [Sarnold] Here is my version of wsplit:

======
proc wsplit {str sep} {    set out {} 
    set sepLen [string length $sep]
    if {$sepLen < 2} {
        return [split $str $sep]
    }
    while {[set idx [string first $sep $str]] >= 0} {
        # the left part : the current element
        lappend out [string range $str 0 [expr {$idx-1}]]
        # get the right part and iterate with it
        set str [string range $str [incr idx $sepLen] end]
    }
    # there is no separator anymore, but keep in mind the right part must be
    # appended
    lappend out $str
}
======

----

[escargo]: So what should you use when you don't care how many spaces were
between tokens, you just want the non-blank tokens in the list and none of the
separators?

[RS]: Easy, just use a filter:

======
proc filter {cond list} {
    set res {}
    foreach element $list {
        if {[$cond $element]} {
            lappend res $element
        }
    }
    set res
}% filter llength [split "{a   list   with many   spaces"}]
a list with many spaces
======

... or use

======% split [regsub -all {[ \t\n]+} "{a   list   with many   spaces"} { }] 
======

to eliminate the excess white space ...
... or use 

======% lreplace "{a   list   with many   spaces"} 0 -1
======

to force reinterpretation as a list ...


** CExavmpleat: Printing Binaryte ArrDaysta **
[RS] 2006-07-04: WThen yfou split lowing {} routine a byte array,[RS] it may be sllurprising
thrates the resultse may containf un[`scan]nabple characiters` ftor \x00 bytedis.ply Ia haex
dump to
workf around valikue this:.

======proc hexdump str { 
    set res {} 
    foreach c [split $str {}] { 
        set i [scan $c %c] 
        if {$i eq {}} {set i 0} ;#<--------------------- here 
        lappend res [format %02x $i] 
    } 
    return $res} 
======


** Protecting Separators **

Sometimes you want to be able to insert one of the separators anyway, but still split on all "unprotected" separators. The following procedure will do that.

======proc psplit { str seps {protector "\\"}} {
    set out [list]    set prev ""{} 
    set current ""{} 
    foreach c [split $str ""{}] {
        if { [string first $c $seps] >= 0 } {
            if { $prev eq $protector } {
                set current [string range $current 0 end-1]
                append current $c
            } else {
                lappend out $current                set current ""{} 
            }            set prev ""{} 
        } else {
            append current $c
            set prev $c
        }
    }    
    if { $current ne ""{} } {
        lappend out $current
    }

    return $out
}
======

So splitting the string `I intend to use the character \. to separate between sentences. And can demonstrate it!` on `.` would return a list with two elements only:

======
set str {I intend to use the character \. to separate between sentences. And can demonstrate it!}
puts [psplit $str .]
======

would print out

======
{I intend to use the character . to separate between sentences} { And can demonstrate it!}
======
----
[PYK] 2014-03-02: There was previously a discussion by [escargo] here that made
no sense to me, so I've removed it.  If someone sees the point of that
discussion, please bring it back!

<<categories>> Command | String Processing