Additional string functions

Difference between version 101 and 102 - Previous - Next
This page describes various string processing that exist beside the standard commands

<<TOC>>

** Commands **

%|name|description|%
&|||&
&|[textutil%|%adjust%|%]||&
&|[textutil%|%tcllib::textutil::adjust%|%]||&
&|[textutil%|%tcllib::textutil::blank%|%]||&
&|[textutil%|%tcllib::textutil::cap%|%]||&
&|[textutil%|%tcllib::textutil::chop%|%]||&
&|[textutil%|%tcllib::textutil::getPredefined%|%]||&
&|[textutil%|%tcllib::textutil::indent%|%]||&
&|[textutil%|%tcllib::textutil::listPredefined%|%]||&
&|[textutil%|%tcllib::textutil::longestCommonPrefix%|%]||&
&|[textutil%|%tcllib::textutil::longestCommonPrefixList%|%]||&
&|[textutil%|%tcllib::textutil::readPatterns%|%]||&
&|[textutil%|%tcllib::textutil::repeat%|%]||&
&|[textutil%|%tcllib::textutil::splitn%|%]||&
&|[textutil%|%tcllib::textutil::splitx%|%]||&
&|[textutil%|%tcllib::textutil::tabify%|%]||&
&|[textutil%|%tcllib::textutil::tabify2%|%]||&
&|[textutil%|%tcllib::textutil::tail%|%]||&
&|[textutil%|%tcllib::textutil::trim%|%]||&
&|[textutil%|%tcllib::textutil::trimEmptyHeading%|%]||&
&|[textutil%|%tcllib::textutil::trimleft%|%]||&
&|[textutil%|%tcllib::textutil::trimprefix%|%]||&
&|[textutil%|%tcllib::textutil::trimrigth%|%]||&
&|[textutil%|%tcllib::textutil::uncap%|%]||&
&|[textutil%|%tcllib::textutil::undent%|%]||&
&|[textutil%|%tcllib::textutil::untabify%|%]||&
&|[textutil%|%tcllib::textutil::untabify2%|%]||&
&|[Pool (Kupries)%|%Pool%|%]|by [Andreas Kupries], has various string functions, some of which are probably now included in [textutil%|%tcllib::textutil%|%]|&&|[similarity%|%stringDistance%|%] (similarity)|measure the similarity between two strings|&
&|[finding the overlap in two strings%|%findOverlap]|if andhow many of the last lines in $str1 match the first lines in $str2|&
&|[Counting Characters in a string%|%count%|%]|count characters in a string]]|&
&|[case:title -Create WP formatted title strings on the active selection%|%case:title%|%]|Create WP formatted title strings on the active selection|&&|[Longest common subsequence]|The longest ordered list of substrings that can be derived from either string|&
&|[Longest common substring]||&
&|[ycl%|%ycl::string::cmp%|%]|Like the [Unix] [http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cmp.html%|%cmp] command, compares two strings and returns the index at which they differ, or -1 if they are identical.|&
&|[ycl%|%ycl::string::delimit%|%]|split (partition) a string into substrings using any combination of string, match, or regular expressions, returning both the substrings and the delimiters.|&
&|[ycl%|%ycl::string::pstring%|%]|converts non-printing characters in a string to their [Tcl Rules%|%hexadecimal] escape sequences|&
&|[ycl%|%ycl::string::shortmatch%|%]|like [[[string match]]], but returns the index of the last character of the shortest match, or -1|&
&|[ycl%|%ycl::tcl::armour%|%]|Convert a value into a printing string that can be pasted directly into a Tcl script as a value|&



** ASCII map **

No algorithm at all, but may come in handy ;-) 

''If you're on a UNIX box, try: man ascii''

======
proc ascii {} {return {
    00 nul  01 soh  02 stx  03 etx  04 eot  05 enq  06 ack  07 bel
    08 bs   09 ht   0a nl   0b vt   0c np   0d cr   0e so   0f si
    10 dle  11 dc1  12 dc2  13 dc3  14 dc4  15 nak  16 syn  17 etb
    18 can  19 em   1a sub  1b esc  1c fs   1d gs   1e rs   1f us
    20 sp   21  !   22  "   23  #   24  $   25  %   26  &   27  '
    28  (   29  )   2a  *   2b  +   2c  ,   2d  -   2e  .   2f  /
    30  0   31  1   32  2   33  3   34  4   35  5   36  6   37  7
    38  8   39  9   3a  :   3b  ;   3c  <   3d  =   3e  >   3f  ?
    40  @   41  A   42  B   43  C   44  D   45  E   46  F   47  G
    48  H   49  I   4a  J   4b  K   4c  L   4d  M   4e  N   4f  O
    50  P   51  Q   52  R   53  S   54  T   55  U   56  V   57  W
    58  X   59  Y   5a  Z   5b  [   5c  \   5d  ]   5e  ^   5f  _
    60  `   61  a   62  b   63  c   64  d   65  e   66  f   67  g
    68  h   69  i   6a  j   6b  k   6c  l   6d  m   6e  n   6f  o
    70  p   71  q   72  r   73  s   74  t   75  u   76  v   77  w
    78  x   79  y   7a  z   7b  {   7c  |   7d  }   7e  ~   7f del
}} ;#RS
======



** Linebreak **

'''Break a string into lines of specified maximum length:''' [join] the output of this with \n for hard-wrapped text: 
======
proc linebreak {s {width 80}} {
   set res {}
   while {[string length $s]>$width} {
       set     pos [string wordstart $s $width]
       lappend res [string range     $s 0    [expr $pos-1]]
       set     s   [string range     $s $pos end]
   }
   lappend res $s
} ;# RS
======

[Arjen Markus] An elegant solution, but two remarks:

   * The file word.tcl remarks that word boundaries are platform-dependent. Is this also used in [string wordstart]?
   * More importantly: The above fails if a word is longer than the given width! You get into an endless loop.

** word Wrap Via Tk Text Widget **

[D. McC]: A problem with [string wordstart], if you want hard-wrapped text that reliably looks right, is this (from the [string wordstart] page on the Wiki): "A word is considered to be any contiguous range of alphanumeric (Unicode letters or decimal digits) or underscore (Unicode connector punctuation) characters, ''or any single character other than these"'' (emphasis added). So, if you have a word (in the ordinary sense) preceded or followed by punctuation marks, [string wordstart] will treat the punctuation marks as separate "words," and they may not come out on the same line as the word they go with!

The Tk text widget, with word wrap on, displays wrapped text that reliably looks right, with no punctuation marks separated from the adjoining words. So, to get ''hard''-wrapped text that looks right, you can determine the locations where the text widget ends the wrapped lines, and replicate the text with newlines at those locations. Here's some code I wrote that does this (feel free to suggest improvements). A global variable "formawid" is set to hold the desired width in characters of the wrapped text, and the text widget (here, ".tx") is configured to that width, with word wrap:

======
.tx configure -width $formawid -wrap $wordwrap
wm geometry . {} ; # Make sure the toplevel shrinks or expands to fit
======

Then, after a delay of one-tenth of a second (which seems to be needed for some reason), the following procedure is run. The "whattodo" arg can be either "print" (to prepare hard-wrapped text for printing) or "show" (to display hard-wrapped text in the text widget). If there is a selection, only the selected text will be hard-wrapped; if there isn't, all text in the widget will be hard-wrapped. 

======
proc formanew {whattodo} {
    global formawid
    # Identify beginning and end of text to format:
    if {[.tx tag ranges sel] eq {}} {
        set selon 1.0
        set seloff [.tx index end]
    } else {
        set selon [.tx index sel.first]
        set seloff [.tx index sel.last]
    }
    set texin [expr int($selon)]
    set texend [expr int($seloff)]
    # Initialize variable to hold output:
    set formatext {} 
    for {set i $texin} {$i <= $texend} {incr i} {
        # Get text to newline:
        set endolin [.tx index $i.end]
        set endochar [lindex [split $endolin .] end]
        set whatline [.tx get $i.0 $endolin]
        # If line is blank, insert only newline into output:
        if {[string trim $whatline] eq {}} {
            append formatext \n
            continue
        }
        # If not, then find out where line is wrapped:
        for {set c 1} {$c <= $endochar} {incr c} {
            .tx see $i.$c
            set ceemin [expr {$c-1}]
            set boxie [.tx get $i.$ceemin]
            # Get y coordinates of bounding boxes for adjoining characters:
            set pixy [lindex [.tx bbox $i.$ceemin] 1]
            set nexy [lindex [.tx bbox $i.$c] 1]
            # If y coordinate of bounding box is greater than for
            # preceding character, line has been wrapped, so
            # insert preceding character plus newline into output:
            if {$nexy > $pixy} {
                append formatext $boxie\n
                .tx see $i.$c
            } else {
                # Otherwise, insert only the preceding character:
                append formatext $boxie
            }
        }
        # Replicate existing newline from text widget:
        if {$i < $texend} {
            append formatext \n
        }
    }
    if {$whattodo eq {print}} {
        return $formatext
    } else {
        .tx delete $selon $seloff
        .tx insert $selon $formatext
        .tx edit separator
    }
}

after 100 formanew
======

[hv] If you want to format the lines, look into textutil::adjust. It can do left-, right-, or center-justify, adjust the text width, and do hyphenation.



** Occurrence Count **

'''Count number of occurrences of a substring''' in a string:

======
proc scount {subs string} {
    regsub -all $subs $string $subs string
}
proc scount2 {subs string} {
    regexp -all $subs $string ;# 8.3
}
======

The latter can also be defined by currying (see [Custom curry]):

======
interp alias {} scount3 {} regexp -all
======

[JH] writes, in a comp.lang.tcl thread talking about this subject

======
proc scount4 {subs string} {
    regexp -all ***=$subs $string
} 
======

and [AM] initially proposes:

======
proc scount5 {subs string} {
    set count [llength [split [string map [list $subs \uFFFF]  $string] \uFFFF]]
    incr count -1 
}
======

but that doesn't make use of $string, so something is missing...

Stephane A. writes:

======
proc countstrings {data search} {
    set l [string length $search]
    set count 0
    while {[set i [string first $search $data]]>=0} {
        incr count
        incr i $l
        set data [string range $data $i end]
    }
    set count
} 
======



** Reverse a String **

======
proc sreverse s {join [lreverse [split $s {}]] {}} ;#RS
======

where ''lreverse'' is of course on [Additional list functions]...

----


[FW] notes: this goes through a transitional list form.  Still being a CPU speed and memory junkie, I feel obliged to correct this :P I'll call it srever''se'':
======
proc sreverse s {
    set res {} 
    set i [string length $s]
    while {$i >= 0} {
        set res "$res[string index $s [incr i -1]]"
    }
    return $res
}
======

[RS] agrees, except that `[append]` may be more efficient:

======
proc sreverse s {
    set l [string length $s]
    set res {} 
    while {$l} {append res [string index $s [incr l -1]]}
    set res
}
======

[LV] notes that two more solutions recently appeared on comp.lang.tcl:
The first is by [Michael Schlenker]:

======
proc string_reverse str {
    set rts {} 
    for {set i [string length $str]; incr i -1} {$i >= 0} {incr i -1} {
        append rts [string index $str $i]
    }
    return $rts
}
======

and the second by [(R. T. Wurth]:

======
proc srev s {
    return [join [::struct::list reverse [split $s {}]] {}]
}
======

[Am] 2005-12-13: The question came up again in the chatroom yesterday, and I decided to measure the performance of various alternatives ... See: [Performance of string reversing algorithms]



** Longest Common Prefix **

Also found in [textutil%|%tcllib::textutil]

[Arjen Markus] At times I have had the need (not too urgent though) of a function/proc to determine the position where two strings (or lists) become different:

   "Arjen Markus" versus "Arjen Marcus" --> position: 9
   "Arjen Markus" versus "Arjan Markus" --> position: 3

(though mostly these are cases where my name and the variations that I commonly encounter are not involved) - [RS]: That would be the length of the common prefix which one might implement like this:

======
proc commonPrefix {a b} {
    set res {}
    foreach i [split $a {}] j [split $b {}] {
        if {$i eq $j} {append res $i} else break
    }
    set res
}
======

Here is a generalized version that takes any number of strings and returns the prefix which all have in common:

======
proc longestCommonPrefix strings {
   set res {}
   set i 0
   foreach char [split [lindex $strings 0] {}] {
       foreach string [lrange $strings 1 end] {
           if {[string index $string $i] ne $char} {
               return $res
           }
       }
       append res $char
       incr i
   }
   set res
} ;# RS
======

[MGS] 2004-05-13: Here's another way - sort a list of strings and then compare the first and last:

======
proc string:common {string1 string2} {
    set i 1
    while {[string equal -length $i $string1 $string2]} {incr i}
    return [string range $string1 0 [expr {$i-2}]]
}

proc string:common:list {args} {
    if { [llength $args] < 2 } {
        return -code error "wrong # args: must be >= 2"
    }
    set list [lsort $args]
    return [string:common [lindex $list 0] [lindex $list end]]
}
======

[JEC] 2010-03-18, notes the above `string:common` has a bug and never returns on identical strings.  Even when fixed  it does O(n^2) comparisons

[BBH] 2004-05-13: This got me thinking that RE back references would be a perfect fit, a little playing & it turns out I was right ;)

======
proc prefix {s1 s2} {
    regexp {^(.*).*\0\1} "$s1\0$s2" all pref
    return $pref
}
======

and easily generalized for multiple words

======
proc prefix {str args} {
    set re {^(.*).*}
    foreach s $args {
        append re {\0\1.*}
        append str "\0$s"
    }
    regexp $re $str all pref
    return $pref
}
======

[NEM] 2005-03-17: Another RE variation:

======
proc reprefix {str1 str2} {
    regexp -inline ^[join [split $str1 {}] ?]? $str2
}
======

[NEM] 5mins later. 'Tis broken:

======none
% reprefix iamadonut iamatoilet
iamat
======

Whoops.



** Subscripts **

The characters for which subscript versions exist in Unicode (digits, parens, some operators -see the list in code) are converted to their subscripted Unicodes. Others are left unchanged. (For superscripts the positions 1,2,3 seem not to be filled in my installation, so I didn't add the corresponding code...)

======
proc subscript s {
    set res {} 
    foreach char [split $s {}] {
        set pos [lsearch -exact {0 1 2 3 4 5 6 7 8 9 + - = ( )} $char]
        if {$pos>=0} {set char [format %c [incr pos 0x2080]]}
        append res $char
    }
    set res
} ;# RS
% puts H[subscript 2]O
H?O
======



** Strictly a Floating Point Number **

the regular `[string] is double` term fires on integers too. The following fires on real floats only:

======
proc isFloat x {expr {[string is double -strict $x] && ![string is int $x]}} ;#RS
======



** Sort a String **

[AM]: For some symbolic manipulations (w.r.t. group theory, oh just for the fun of it), I need to sort the characters in a string, so that for instance `ababa` becomes `aaabb`.

This can be done via:

======
proc charsort {string}  {
   return [join [lsort [split $string {}]] {}]
}
======



** Longest String **

Here's a little proc to find the length of the longest string in a list:

======
proc maxlen args {
    # Written 2003 by Ed Suominen, hereby placed in the public domain
    if {[llength $args] > 1} {
        if {
            [set x [string length [lindex $args 0]]] -
            [set y [string length [lindex $args 1]]] < 0
        } {
            return -1
        } elseif {$x == $y} {
            return 0
        } else {
            return 1
        }
    } else {
        set x [lsort -decreasing -command maxlen [lindex $args 0]]
        return [string length [lindex $x 0]]
    }
}
======

Alternative:

======
proc maxlen args {
    if {[llength $args] == 1} {
        set args [lindex $args 0]
    }
    set res [string length [lindex $args 0]]
    foreach i [lrange $args 1 end] {
        set l2 [string length $i]
        if {$l2 > $res} {set res $l2}
    }
    set res
} ;# RS
======
[KPV] one liner using lmap:
======
set words {now is the time for all good men to come to the aid of their country}
set how_long [::tcl::mathfunc::max {*}[lmap v $words {string length $v}]]
set longest_word [lindex [lsort -index 1 -integer [lmap v $words { list $v [string length $v] }]] end 0]
======




** String Insert **

'''Insert a string into another''' at a given position, analogous to [linsert]:

======
proc strinsert {string pos char} {
    set original [string index $string $pos]
    string replace $string $pos $pos $char$original
} ;# RS
% strinsert hello 1 x
hxello
======

[AMG]: Why doesn't this come standard in the `[string]` command?

Also, don't let the formal parameter name "char" fool you.  With this code, the string to be inserted can be any length.  I'd write it this way:

======
proc strinsert {string pos insertion} {
    append insertion [string index $string $pos]
    string replace $string $pos $pos $insertion
}
namespace ensemble configure string -map\
    [dict replace [namespace ensemble configure string -map] insert strinsert]
% string insert hello 1 xxx
hxxxello
======

Or, even faster:

======
proc strinsert {string pos insertion} {
    string replace $string $pos $pos $insertion[string index $string $pos]
}
======

[AMG]: See [TIP] 475 [http://tip.tcl.tk/475] for a new [[[string insert]]] command.

** N-th Occurrence of a Substring **

[Silas] 2005-10-14:  If you want to find the third or fourth occurrence of a string in another string, you'll have to use [string first] many times. The following proc could help:

======
proc mystringFirst {substring mystring ocorrencia} {
    if {!$ocorrencia} {return -1}
    set index 0

    for {set i 0} {$i < $ocorrencia} {incr i} {
        set index [string first $substring $mystring]
        set mystring [string range $mystring [expr $index + 1] [string length $mystring]]
    }

    return $index
}
======

[MG] 2017-08-17 This doesn't seem to work correctly (or at least gives very unexpected results for me):
======
% set str "foo bar baz boing sprocket banana"
% mystringFirst ba $str 1
4
% mystringFirst ba $str 2
3
======
A couple of alternatives that give the results I would expect:
======
proc mystringFirst2 {needle haystack count} {
        set pos -1
        while { $count && [set pos [string first $needle $haystack $pos+1]] > -1 } {
                incr count -1
        }
        return $pos
}

proc mystringFirst3 {needle haystack count} {
        set indices [regexp -all -inline -indices $needle $haystack]
        if { [llength $indices] >= $count } {
                return [lindex $indices $count-1 0]
        } else {
                return -1
        }
}
======
mystringFirst3 obviously takes a regexp needle, rather than a plain string, but if you add
======
  set needle [regsub -all {([^a-zA-Z0-9 ])} $needle {\\\1}]
======
to the start it'll work on a plain string instead.



** String to Proper English Title **

[D. McC] 2005-10-20: The "string totitle" subcommand only, and always, capitalizes the first letter in a string, no matter how many words are in the string or what the words are. Here's some code to convert a multiple-word expression in English to a properly capitalized title, in which all initial letters are capitalized except those of articles, conjunctions, and prepositions with four or fewer letters. '''(Needs improvement--see revised version, farther below)'''

======
proc title str {
    set output {} 
    set nocaps [list a an and at but by for from in into of on or the to with]
    foreach word [split $str] {
        if {[lsearch $nocaps $word] == -1} {
            set word [string totitle $word]
        }
        append output "$word "
    }
    return [string trim $output]
}
======

Example:

======none
% set bogomips "groundwork of the metaphysics of balderdash"
groundwork of the metaphysics of balderdash
% title $bogomips
Groundwork of the Metaphysics of Balderdash
======

[DKF]: Cool! It's not quite right though. The following gets closer, but isn't right yet either (testing on your paragraph above, of course!)

======
proc title {str} {
    set output {} 
    set nocaps {a an and at but by for from in into of on or the to with}
    foreach word [regexp -all -inline {[\w'.]+|\W+} $str] {
        if {[string match {[A-Za-z]*} $word] && [lsearch $nocaps $word] == -1} {
            # In 8.5 should use the 'ni' operator instead of the [lsearch]
            set word [string totitle $word]
        }
        append output $word
    }
    return [string trim $output]
}
======

[LES]: You forget that you'll want those "nocaps" words to be turned to title if they begin a sentence. Here is a proc I use to rename files (usually mp3). It doesn't handle punctuation, but handles hyphens, e.g.: "Artist - A Song Title". Tweaking it to also take punctuation (periods or first word of title) in consideration should be trivial. That is not desired in file renaming because periods (dots) do not mean the same as in regular titles.

======
proc mp3 args {
    set _nocaps {
        a as à às ao aos de da das do dos e em na nas no nos 
        o os ou para por que sem sob
        an and are at but for from if in is it's not of on or 
        the to under vs vs. with without
        au aux avec dans des en et le la les ou par pour qui si
        con del el en la las los sin y
    }; # German and Italian, anyone?
 
    foreach _file [glob *] { 
        if { $_file eq {.:} } { continue }
        # lowercase the whole name
        set _old  $_file
        set _file [string tolower $_file]
        set _new {}              ;# empty _new in each iteration
        set _c   0               ;# zero counter in each iteration
 
        foreach _word $_file { 
            incr _c
 
            # anything right after " - " probably is the first word of a phrase.
            # if current word is "-", reset counter so next word is 1 and gets caps
            if {$_word eq {-}} {set _c 0}

            # if current word is 1, it gets caps
            if {$_c == 1} {set _word [string totitle $_word]}

            # if it is not in exceptions, it gets caps
            if {[lsearch $_nocaps $k ] < 0} { 
               set k [string totitle $k]
            }

            # add the word to the new name
            lappend _new $k
        }
 
        # UNCOMMENT the two next lines if you want files renamed automatically
        #file rename  $_old  RENAME_TEMP
        #file rename  RENAME_TEMP  $_new
        puts  "$_old\n$_new\n"
    }
}
======

[D. McC]:  OK, I do want the "nocaps" words capitalized if they (1) begin the title or (2) come right after a colon. Also, as I belatedly noticed, quotation marks need to be stripped out for "string totitle" to work right, but then put back into the finished product. I'd go a long way to avoid a regular expression like {`[[\w'.]]+|\W+`}, though. Let's try this:

======
proc title str {
    set output {} 
    set nocaps [list a an and at but by for from in into of on or the to with]
    set count 0
    foreach word [split $str] {
        # Strip quotation marks:
        if {[string index $word 0] eq "\""} {
            set quote 1
            set word [string trim $word \"]
        } else {
            set quote 0
        }
        # Always capitalize the first word; otherwise,
        # don't capitalize any words in the "nocaps" list:
        if {$count == 0 || [lsearch $nocaps $word] == -1} {
            set word [string totitle $word]
        }
        # Add word plus space, with or without quotation marks, to output:
        if {$quote} {
            append output "\"$word\" "
        } else {
            append output "$word "
        }
        # Capitalize any word after a colon:
        if {[string index $word end] eq {:}} {
            set count 0
        } else {
            incr count
        }
    }
    return [string trim $output]
}
======

Example:

======none
% set wordsworth {what I say is: by gum, give me the finest "bogomips" in the universe!}
what I say is: by gum, give me the finest "bogomips" in the universe!
% title $wordsworth
What I Say Is: By Gum, Give Me the Finest "Bogomips" in the Universe!
======

======
proc mystringFirst {substring mystring ocorrencia} {
    if {!$ocorrencia} {return -1}
    set index 0
  
    for {set i 0} {$i < $ocorrencia} {incr i} {
        set index [string first $substring $mystring]
        set mystring [string range $mystring [expr $index + 1] [string length $mystring]]
    }
  
    return $index
}
======

[pw]: Here's an example that doesn't use foreach.

======
proc title-case text {
    for {set index 0} {$index < [string length $text]} {incr index} {
        if { [string wordstart $text $index] == $index } {
            set text [string replace $text $index $index [string toupper [string index $text $index]]]
        } else {
            set text [string replace $text $index $index [string tolower [string index $text $index]]]
        }
    }
    return $text
}
======



** Misc **

LV: Take a look at [Tcl-FAQ]'s part 5 to gather ideas on other
packages with string functions.



** Letterspacing **

[escargo]:  

spreading a string by inserting blanks between each two characters. [KBK] notes that there's a printer's proverb: Anyone who would l e t t e r s p a c e    l o w e r c a s e   would steal sheep. Simply functional:

======
proc letterspace s {join [split $s {}] { }} ;# RS
% letterspace {steal sheep}
s t e a l   s h e e p
======

In the book ''Stop Stealing Sheep & find out how type works,'' by Erik Spiekermann and E.M. Ginger,
Adobe Press, 1993, on page 7, in the side bar the quotation attributed to Frederic Goudy was,
"Anyone who would letterspace black letter would steal sheep."



** Empty string test **

See: [is_empty]

Summary: the fastest way is `[[[expr] {![[[string length] $str]]]]`.


<<categories>> Category Discussion | Category Example | Category String Processing