Splitting a string on arbitrary substrings

Content

Arjen Markus 2006-10-08: In the chatroom yesterday Richard Suchenwirth came up with an elegant solution for the following problem:

Given a string (actually the contents of a file) that contains empty lines (so \n\n), split it into parts on these empty lines.

You could use regexp to do this, but a much more elegant way (IMHO) is to replace the substring that you want to split on by a character that is not present in the original string and then use split. The choice of that character is of course a bit delicate and the method is limited to fixed substrings.

Still, the code is simple:

set list [split [string map [list $substring $splitchar] $string] $splitchar]

Now, what character could you choose for $splitchar? One fascinating choice is \u0080 - it is part of a region of the UNICODE character map that is more or less forbidden or reserved. It means that it is very unlikely to be present in the original string (unless that is a binary string of course, in which case most if not all bets are off, but splitting binary strings is a rare and dangerous thing anyway).

WJP 2006-08-10: \u0080 is fairly safe but you can't be quite sure since it is a legal Unicode control character. A better choice is \uFFFE or \uFFFF. Both are guaranteed not to be characters and so are absolutely safe.

JMN 2006-11-02:

My timings indicate the above method is about 10x faster than textutil::splitx

However.. Tcl's split alone on a single-char separator is 4x faster again.

I'd love to see a multi-character 'split' in the core. string split perhaps?

Lars H, 2008-07-18: Thinking about that same idea, I believe the following syntax may be appropriate for a split extended that way:

    :   '''split''' ''text'' ?''string'' ''list'' ...? ?''chars''?

The text is the string to split. Each string is a (possibly multicharacter) string at which it might be split, and when such a split occurs, the elements of the following list are appended to the result before processing resumes on the rest of the string. The chars are split-characters, as in current Tcl.

The idea of the list arguments is that sometimes you want to split on several substrings, but you also want to know which one it was at each position; current split doesn't provide that information. With the above you could do

  split {1+1-3+4-2} + plus - minus

to get

  1 plus 1 minus 3 plus 4 minus 2

  split {1+1-3+4-2} + {} - minus

to get

  1 1 minus 3 4 minus 2

However, one could probably use clever combinations of regexp -all, regsub, and/or string map to get this effect as well, so it's no great leap in expressive power.

splitstr

PYK 2016-02-16:

The following procedure, splitstr, splits a string on the regular expressions in $exprs and returns a list where each item is a two-item list containing the text leading up to the text that matched an expression, along with the text that matched the expression, such that concatenating all these yields the original text.

proc splitstr {text exprs} {
    if {$text eq {}} {
        return $text
    }
    set exprs [join $exprs |]
    set regexp ((?:(?!$exprs|$).)*)($exprs|$)
    lmap {x y z} [regexp -all -inline $regexp $text] {list $y $z}
}

Category String Processing

Category Parsing