[Arjen Markus] (10 august 2006) In the chatroom yesterday [Richard Suchenwirth] came up with an elegant solution for the following problem: Given a string (actually the contents of a file) that contains empty lines (so \n\n), split it into parts on these empty lines. You could use [regexp] to do this, but a much more elegant way (IMHO) is to replace the substring that you want to split on by a character that is not present in the original string and then use [split]. The choice of that character is of course a bit delicate and the method is limited to ''fixed'' substrings. Still, the code is simple: set list [split [string map [list $substring $splitchar] $string] $splitchar] Now, what character could you choose for $splitchar? One fascinating choice is '''\u0080''' - it is part of a region of the UNICODE character map that is more or less forbidden or reserved. It means that it is very unlikely to be present in the original string (unless that is a binary string of course, in which case most if not all bets are off, but splitting binary strings is a rare and dangerous thing anyway). If you need to split on a substring that may vary (for instance a sequence of one or more empty lines), check out the [[split_re]] method in [Tcllib]. ---- [WJP] (10 August 2006) \u0080 is fairly safe but you can't be quite sure since it is a legal Unicode control character. A better choice is \uFFFE or \uFFFF. Both are guaranteed not to be characters and so are absolutely safe. ---- [JMN] 2006-11-02 <
> My timings indicate the above method is about 10x faster than [textutil]::[splitx] However.. Tcl's [split] alone on a single-char separator is 4x faster again. I'd love to see a multi-character 'split' in the core. [string split] perhaps? [Lars H], 2008-07-18: Thinking about that same idea, I believe the following syntax may be appropriate for a [split] extended that way: : '''split''' ''text'' ?''string'' ''list'' ...? ?''chars''? The ''text'' is the string to split. Each ''string'' is a (possibly multicharacter) string at which it might be split, and when such a split occurs, the elements of the following ''list'' are appended to the result before processing resumes on the rest of the string. The ''chars'' are split-characters, as in current Tcl. The idea of the ''list'' arguments is that sometimes you want to split on several substrings, but you also want to know which one it was at each position; current [split] doesn't provide that information. With the above you could do split {1+1-3+4-2} + plus - minus to get 1 plus 1 minus 3 plus 4 minus 2 or split {1+1-3+4-2} + {} - minus to get 1 1 minus 3 4 minus 2 However, one could probably use clever combinations of [regexp] -all, [regsub], and/or [string map] to get this effect as well, so it's no great leap in expressive power. <> String Processing | Parsing