Version 7 of Split On Whitespace

Updated 2018-06-07 09:58:38 by CecilWesterhof

Created by CecilWesterhof.

Often I want to split a string on repeating white-space. The normal split function does not do what I want. For example:

split "   To   show    the   problem.   "

gives:

{} {} {} To {} {} show {} {} {} the {} {} problem. {} {} {}

What I want is:

To show the problem.

That is why I created the following proc:

# A split that works on repeating white-space
# With:
#     splitOnWhiteSpace "   To   show    the   problem.   "
# You get:
#     "To show the problem."
# instead of:
#     "{} {} {} To {} {} show {} {} {} the {} {} problem. {} {} {}"
# With min/max you can verify the number of elements
proc splitOnWhiteSpace {value {min -1} {max -1}} {
    if {!([string is integer -strict ${min}] && [string is integer -strict ${max}])} {
        error "min and max should both be integers (${min}, ${max})"
    }
    if {(${min} < -1) || (${max} < -1)} {
        error "min and max should both be >= -1 (${min}, ${max})"
    }
    if {(${max} != -1) && (${max} < ${min})} {
        error "min should be <= max (${min}, ${max})"
    }
    set splitLst [list {*}[string map {
        \{ \\\{
        \" \\\"
    } ${value}]]
    if {${min} != -1} {
        if {${max} == -1} {
            set max ${min}
        }
        set length [llength ${splitLst}]
        if {(${length} < ${min}) || (${length} > ${max})} {
            if {${min} == ${max}} {
                set msgEnd "${min} values"
            } else {
                set msgEnd "between ${min} and ${max} values"
            }
            error "'${value}' contains ${length} instead of ${msgEnd}"
        }
    }
    return ${splitLst}
}

With this I get:

To show the problem.

Beside splitting on repeating white-space, it can also check the number of elements. For example:

splitOnWhiteSpace "Just a test." 4

gives:

'Just a test.' contains 3 instead of 4 values

and:

splitOnWhiteSpace "Just a test." 4 5

gives:

'Just a test.' contains 3 instead of between 4 and 5 values

As always: comments, tips and questions are appreciated.


StephanKuhagen:

About four times faster compared to the regexp-line:

list {*}[string map {\{ \\\{} $value]

The string map is needed to avoid unmatched open braces in lists. If you know, that there will never be an opening brace in your inputs, you can get it even faster.

CecilWesterhof

Thanks, I implemented it. For the curious, originally I used:

set splitLst [regexp -all -inline {\S+} ${value}]