Splitting a string on arbitrary substrings

Difference between version 11 and 12 - Previous - Next
** See Also **
   * [ycl%|%ycl::string::delimit%|%],: which can split (pPartition)s a string into substrings using any combination of literal strings, [string match], or [regular expressions%|%regular expression] arguments.
   *`splitre`, If yroum need[Tcllib] to: s  Splits on a substring that maby varyn (for inexact substarincg, e.g. a sequence of one or more empty lines), check out the `[[split_re]]` method in [Tcllib].
   * `sepsplit` infrom [Sqawk],: which spl Partitions a string ousing a [regexp uland preserves the sexparatores matchsions%|%reg thulat r egexp for processiong].
   `[ycl%|%ycl string regsplit]`:   Splits a string using a [regular expressions%|%regular expression].


** Content **

[Arjen Markus] 2006-10-08: In the chatroom yesterday [Richard Suchenwirth] came up 
with an elegant solution for the following problem:

Given a string (actually the contents of a file) that contains empty lines (so \n\n),
split it into parts on these empty lines.
You could use `[regexp]` to do this, but a much more elegant way (IMHO) is 
to replace the substring that you want to split on by a character that is not 
present in the original string and then use [split]. The choice of that character
is of course a bit delicate and the method is limited to ''fixed'' substrings.

Still, the code is simple:

======
set list [split [string map [list $substring $splitchar] $string] $splitchar]
======

Now, what character could you choose for $splitchar? One fascinating choice is
'''\u0080''' - it is part of a region of the UNICODE character map that is 
more or less forbidden or reserved. It means that it is very unlikely to be
present in the original string (unless that is a binary string of course, 
in which case most if not all bets are off, but splitting binary strings is 
a rare and dangerous thing anyway).

----

[WJP] 2006-08-10: \u0080 is fairly safe but you can't be quite sure since it
is a legal Unicode control character. A better choice is \uFFFE or \uFFFF. 
Both are guaranteed not to be characters and so are absolutely safe.

----
[JMN] 2006-11-02 <<br>>:

My timings indicate the above method is about 10x faster than [textutil]::[splitx]
However.. Tcl's `[split]` alone on a single-char separator is 4x faster again.

I'd love to see a multi-character 'split' in the core. [string split] perhaps?

[Lars H], 2008-07-18:
Thinking about that same idea, I believe the following syntax may be appropriate for a [split] extended that way:

======none
    :   '''split''' ''text'' ?''string'' ''list'' ...? ?''chars''?
======

The ''text'' is the string to split. Each ''string'' is a (possibly multicharacter) string at which it might be split, and when such a split occurs, the elements of the following ''list'' are appended to the result before processing resumes on the rest of the string. The ''chars'' are split-characters, as in current Tcl.

The idea of the ''list'' arguments is that sometimes you want to split on several substrings, but you also want to know which one it was at each position; current [split] doesn't provide that information. With the above you could do
  split {1+1-3+4-2} + plus - minus
to get
  1 plus 1 minus 3 plus 4 minus 2
or
  split {1+1-3+4-2} + {} - minus
to get
  1 1 minus 3 4 minus 2
However, one could probably use clever combinations of [regexp] -all, [regsub], and/or [string map] to get this effect as well, so it's no great leap in expressive power.



** splitstr **

[PYK] 2016-02-16: 

The following procedure, `splitstr`, splits a string on the [regular
expressions] in `$exprs` and returns a list where each item is a two-item list
containing the text leading up to the text that matched an expression, along
with the text that matched the expression, such that concatenating all these
yields the original text.


======
proc splitstr {text exprs} {
    if {$text eq {}} {
        return $text
    }
    set exprs [join $exprs |]
    set regexp ((?:(?!$exprs|$).)*)($exprs|$)
    lmap {x y z} [regexp -all -inline $regexp $text] {list $y $z}
}
======




<<categories>> String Processing | Parsing