Purpose: use this page to point to extensions with useful unique string functions, as well as for describing additional code or APIs for string functions that are missing. ---- Andreas Kupries has some nice string functionality in his Pool package that should be examined when a standard string package enters the design phase. It can be found at http://www.oche.de/~akupries/soft/pool/index.htm ---- LV: Take a look at [Tcl-FAQ]'s part 5 to gather ideas on other packages with string functions. ---- Items listed on this page without specific urls probably fall into the 'wishlist' category. ---- '''Break a string into lines of specified maximum length:''' [join] the output of this with \n for hard-wrapped text: proc linebreak {s {width 80}} { set res {} while {[string length $s]>$width} { set pos [string wordstart $s $width] lappend res [string range $s 0 [expr $pos-1]] set s [string range $s $pos end] } lappend res $s } ;# RS [Arjen Markus] An elegant solution, but two remarks: * The file word.tcl remarks that word boundaries are platform-dependent. Is this also used in [string wordstart]? * More importantly: The above fails if a word is longer than the given width! You get into an endless loop. [D. McC]: A problem with [string wordstart], if you want hard-wrapped text that reliably looks right, is this (from the [string wordstart] page on the Wiki): "A word is considered to be any contiguous range of alphanumeric (Unicode letters or decimal digits) or underscore (Unicode connector punctuation) characters, ''or any single character other than these"'' (emphasis added). So, if you have a word (in the ordinary sense) preceded or followed by punctuation marks, [string wordstart] will treat the punctuation marks as separate "words," and they may not come out on the same line as the word they go with! The Tk text widget, with word wrap on, displays wrapped text that reliably looks right, with no punctuation marks separated from the adjoining words. So, to get ''hard''-wrapped text that looks right, you can determine the locations where the text widget ends the wrapped lines, and replicate the text with newlines at those locations. Here's some code I wrote that does this (feel free to suggest improvements). A global variable "formawid" is set to hold the desired width in characters of the wrapped text, and the text widget (here, ".tx") is configured to that width, with word wrap: .tx configure -width $formawid -wrap $wordwrap wm geometry . {} ; # Make sure the toplevel shrinks or expands to fit Then, after a delay of one-tenth of a second (which seems to be needed for some reason), the following procedure is run. The "whattodo" arg can be either "print" (to prepare hard-wrapped text for printing) or "show" (to display hard-wrapped text in the text widget). If there is a selection, only the selected text will be hard-wrapped; if there isn't, all text in the widget will be hard-wrapped. proc formanew {whattodo} { global formawid # Identify beginning and end of text to format: if {[.tx tag ranges sel] eq ""} { set selon 1.0 set seloff [.tx index end] } else { set selon [.tx index sel.first] set seloff [.tx index sel.last] } set texin [expr int($selon)] set texend [expr int($seloff)] # Initialize variable to hold output: set formatext "" for {set i $texin} {$i <= $texend} {incr i} { # Get text to newline: set endolin [.tx index $i.end] set endochar [lindex [split $endolin "."] end] set whatline [.tx get $i.0 $endolin] # If line is blank, insert only newline into output: if {[string trim $whatline] eq ""} { append formatext "\n" continue } # If not, then find out where line is wrapped: for {set c 1} {$c <= $endochar} {incr c} { .tx see $i.$c set ceemin [expr {$c-1}] set boxie [.tx get $i.$ceemin] # Get y coordinates of bounding boxes for adjoining characters: set pixy [lindex [.tx bbox $i.$ceemin] 1] set nexy [lindex [.tx bbox $i.$c] 1] # If y coordinate of bounding box is greater than for # preceding character, line has been wrapped, so # insert preceding character plus newline into output: if {$nexy > $pixy} { append formatext $boxie\n .tx see $i.$c } else { # Otherwise, insert only the preceding character: append formatext $boxie } } # Replicate existing newline from text widget: if {$i < $texend} { append formatext "\n" } } if {$whattodo eq "print"} { return $formatext } else { .tx delete $selon $seloff .tx insert $selon $formatext .tx edit separator } } after 100 formanew ---- '''Count number of occurrences of a substring''' in a string: proc scount {subs string} {regsub -all $subs $string $subs string} proc scount2 {subs string} {regexp -all $subs $string ;# 8.3} The latter can also be defined by currying (see [Custom curry]): interp alias {} scount3 {} regexp -all ---- '''Revert a string''' (i.e. if given string "abc" return string "cba": proc srevert {s} {join [lreverse [split $s ""]] ""} ;#RS where ''lrevert'' is of course on [Additional list functions]... ''maybe it used to be, but I don't see it there now.'' [RS]: Oops, it's named ''lreverse'' there - sorry! [FW] notes: this goes through a transitional list form. Still being a CPU speed and memory junkie, I feel obliged to correct this :P I'll call it srever''se'': proc sreverse {s} { set res "" set i [string length $s] while {$i >= 0} { set res "$res[string index $s [incr i -1]]" } return $res } [RS] agrees, except that [append] may be more efficient: proc srevert s { set l [string length $s] set res "" while {$l} {append res [string index $s [incr l -1]]} set res } [LV] notes that two more solutions recently appeared on comp.lang.tcl: The first is by [Michael Schlenker]: proc string_reverse {str} { set rts "" for {set i [string length $str]; incr i -1} {$i >= 0} {incr i -1} { append rts [string index $str $i] } return $rts } and the second by [(R. T. Wurth]: proc srev s { return [join [::struct::list reverse [split $s {}]] {}] } ---- '''[Similarity] of two strings''' can be had there ---- [Arjen Markus] At times I have had the need (not too urgent though) of a function/proc to determine the position where two strings (or lists) become different: "Arjen Markus" versus "Arjen Marcus" --> position: 9 "Arjen Markus" versus "Arjan Markus" --> position: 3 (though mostly these are cases where my name and the variations that I commonly encounter are not involved) - [RS]: That would be the length of the common prefix which one might implement like this: proc commonPrefix {a b} { set res {} foreach i [split $a ""] j [split $b ""] { if {$i==$j} {append res $i} else break } set res } Here is a generalized version that takes any number of strings and returns the prefix which all have in common: proc longestCommonPrefix strings { set res {} set i 0 foreach char [split [lindex $strings 0] ""] { foreach string [lrange $strings 1 end] { if {[string index $string $i] != $char} { return $res } } append res $char incr i } set res } ;# RS ''[MGS] [[2004/05/13]]'' - Here's another way - sort a list of strings and then compare the first and last: proc string:common {string1 string2} { set i 1 while { [string equal -length $i $string1 $string2] } { incr i } return [string range $string1 0 [expr {$i-2}]] } proc string:common:list {args} { if { [llength $args] < 2 } { return -code error "wrong # args: must be >= 2" } set list [lsort $args] return [string:common [lindex $list 0] [lindex $list end]] } ''[BBH] [[2004/05/13]]'' - This got me thinking that RE back references would be a perfect fit, a little playing & it turns out I was right ;) proc prefix {s1 s2} { regexp {^(.*).*\0\1} "$s1\0$s2" all pref return $pref } and easily generalized for multiple words proc prefix {str args} { set re {^(.*).*} foreach s $args { append re {\0\1.*} append str "\0$s" } regexp $re $str all pref return $pref } [NEM] 17Mar2005 - Another RE variation: proc reprefix {str1 str2} { regexp -inline ^[join [split $str1 ""] ?]? $str2 } [NEM] 5mins later. 'Tis broken: % reprefix iamadonut iamatoilet iamat Whoops. ---- '''Subscripts:''' The characters for which subscript versions exist in Unicode (digits, parens, some operators -see the list in code) are converted to their subscripted Unicodes. Others are left unchanged. (For superscripts the positions 1,2,3 seem not to be filled in my installation, so I didn't add the corresponding code...) proc subscript s { set res "" foreach char [split $s ""] { set pos [lsearch -exact {0 1 2 3 4 5 6 7 8 9 + - = ( )} $char] if {$pos>=0} {set char [format %c [incr pos 0x2080]]} append res $char } set res } ;# RS % puts H[subscript 2]O H?O ---- '''String is float''': the regular ''[string] is double'' term fires on integers too. The following fires on real floats only: proc isFloat x {expr {[string is double -strict $x] && ![string is int $x]}} ;#RS ---- '''Letterspacing:''' spreading a string by inserting blanks between each two characters. [KBK] notes that there's a printer's proverb: Anyone who would l e t t e r s p a c e l o w e r c a s e would steal sheep. Simply functional: proc letterspace s {join [split $s ""] " "} ;# RS % letterspace "steal sheep" s t e a l s h e e p ---- In the book ''Stop Stealing Sheep & find out how type works,'' by Erik Spiekermann and E.M. Ginger, Adobe Press, 1993, on page 7, in the side bar the quotation attributed to Frederic Goudy was, "Anyone who would letterspace black letter would steal sheep." -- [escargo] ---- [AM] For some symbolic manipulations (w.r.t. group theory, oh just for the fun of it), I need to sort the characters in a string, so that for instance "ababa" becomes "aaabb". This can be done via: proc charsort { string } { return [join [lsort [split $string {}] ] {} ] } ---- Here's a little proc to find the length of the longest string in a list: proc maxlen {args} { # Written 2003 by Ed Suominen, hereby placed in the public domain if { [llength $args] > 1 } { if { [set x [string length [lindex $args 0]]] - [set y [string length [lindex $args 1]]] < 0 } { return -1 } elseif { $x == $y } { return 0 } else { return 1 } } else { set x [lsort -decreasing -command maxlen [lindex $args 0]] return [string length [lindex $x 0]] } } # Alternative: proc maxlen args { if {[llength $args]==1} {set args [lindex $args 0]} set res [string length [lindex $args 0]] foreach i [lrange $args 1 end] { set l2 [string length $i] if {$l2 > $res} {set res $l2} } set res } ;# RS ---- '''Insert a string into another''' at a given position, analogous to [linsert]: proc strinsert {string pos char} { set original [string index $string $pos] string replace $string $pos $pos $char$original } ;# RS % strinsert hello 1 x hxello ---- [finding the overlap in two strings] | [string] - [Arts and crafts of Tcl-Tk programming] ---- [RHS] ''23Feb2005'' There have been a number of times where I have wanted a version of '''subst''' that preserves grouping. For example, I want to be able to do the following: % array set myarr [subst { a $x b [getConfigValue something] c { a b c d } }] % array get myArr a 1 b blah c { a b c d } My main reason for this is, when the dataset get large, I tend to find \ line continuations fairly ugly... That, and my emacs config doesn't indent well for multiple levels of line continuations if there's sub levels. In pursuit of the above, I came up with the following: proc gsubst {input} { set data {} foreach line [split $input \n] { if { [info complete $data] } { append data " $line" } else { append data "\n$line" } } uplevel 1 list $data } The above proc should, in theory, cause the following two to work exactly the same list a 1 \ b $x \ c { 1 2 } \ d [somecommand] gsubst { a 1 b $x c { 1 2 } d [somecommand] } ---- [[What should the category be for this?]] [RS]: Hm.. "Additional"? That's the criterion I quickly search this set of pages with - such takeaway snippets used to be in the [Bag of algorithms] until it grew too thick. ---- [Silas] - 2005.10.14 - If you want to find the third or fourth ocurrance of a string in another string, you'll have to use [string first] many times. The following proc could help: proc mystringFirst {substring mystring ocorrencia} { if {!$ocorrencia} {return -1} set index 0 for {set i 0} {$i < $ocorrencia} {incr i} { set index [string first $substring $mystring] set mystring [string range $mystring [expr $index + 1] [string length $mystring]] } return $index } ---- '''String to proper English title''' [D. McC]: Oct 20 2005 - The "string totitle" subcommand only, and always, capitalizes the first letter in a string, no matter how many words are in the string or what the words are. Here's some code to convert a multiple-word expression in English to a properly capitalized title, in which all initial letters are capitalized except those of articles, conjunctions, and prepositions with four or fewer letters. (Let me know if I forgot any articles, conjunctions, or whatnot.) proc title {str} { set output "" set nocaps [list a an and at but by for from in into of on or the to with] foreach word [split $str] { if {[lsearch $nocaps $word] == -1} { set word [string totitle $word] } append output "$word " } return [string trim $output] } Example: % set bogomips "groundwork of the metaphysics of balderdash" groundwork of the metaphysics of balderdash % title $bogomips Groundwork of the Metaphysics of Balderdash ---- [Finding the overlap in two strings] [Counting characters in a string] ---- [Category Discussion] [Category Example]