Recently a discussion arose concerning a way of determining how many occurrences of a character appeared in a string. Several methods were proposed: * [regexp] * [split] * [foreach]/[split]/[incr] * [string map] Isn't this page a good reason why 'string count' should be added to Tcl? - [RS] 2007-1013: This wiki has some 20,000 pages, many with one or more useful algorithms. But I wouldn't want them all drawn into the Tcl core... ([Tcllib] maybe, although that is already too crowded in some parts). If you can implement a (sometimes) useful function in one line of code, just do it locally :^) - but let's keep Tcl [simple]! #! /usr/tcl84/bin/tclsh set mystring "line1\nline2\nline3\nline4\n" proc splitline {string} { set rc [llength [split $string "\n"]] incr rc -1 return $rc } proc regline {string} { set rc [regexp -line -all "\n" $string] return $rc } proc splitchar {string} { foreach x [split $string ""] {if {[catch {incr counters($x)}]} {set counters($x) 1}} set rc $counters(\n) return $rc } puts [time {splitline $mystring} 10000] puts [time {regline $mystring} 10000] puts [time {splitchar $mystring} 10000] On my SPARC Solaris, using Tcl 8.4, I get 49 microseconds per iteration 154 microseconds per iteration 963 microseconds per iteration as the result. - [RS]: Surprising, but good to know, that [split] is by far the fastest way... [Brett Schwarz] added this proc as well: proc strmap {str} { return [expr {[string length $str]-[string length [string map {"\n" ""} $str]]}] }; on my SuSE Linux 7.3 machine (2X333MHz PII) with Tcl 8.4, I get these results: 34 microseconds per iteration 87 microseconds per iteration 782 microseconds per iteration 34 microseconds per iteration So, it gives the same performance of the split version. However, if the string length is increased, to, let's say, 100 lineX's (i.e. 100 \n), then this proc using [string map] actually scales better. Here are the numbers with 100: 263 microseconds per iteration 1729 microseconds per iteration 2573 microseconds per iteration 141 microseconds per iteration There is quite an improvement.... However, there seems to be a difference in interpreters used. Here is the same timing with 8.3.4.4: 288 microseconds per iteration 1775 microseconds per iteration 8514 microseconds per iteration 578 microseconds per iteration Appears string map got byte compiled in 8.4 (according to [Miguel Sofer]) [RS] contributes maybe not the fastest, but for now the shortest variation: interp alias {} countCharsA {} regexp -all to be called with the wanted character and the string as additional arguments [Brett Schwarz] Some people (please correct me, if I got the wrong person) added some other procs, and here they are with some new times: # [Michael A. Cleverly] proc splitline2 {string} { return [expr {[llength [split $string {\n}]] - 1}] } ''Shouldn't that be'' ... [split $string "\n"] ... ''? As it stands it will split on backslashes and n's, but not on newlines. And the [return] is of course unnecessary.'' [FW]: Yeah, but return is slightly faster. ''[Lars H]: Is it? That's interesting! But '''why''' is it faster?'' [FW]: (belated response) I think it's because when a call is nested into a return now, the interpreter knows that this is the last command executed in the procedure and so it can clean up the stack earlier, or something like that. Pre-8.1 return was actually slower, because it wasn't a special compiled construct. # [Michael A. Cleverly] proc regsubline {string} { return [regsub -all -- {\n} $string {\n} string] } # [Miguel Sofer] proc splitchar2 string { set rc 0 foreach x [split $string ""] { if {$x == "\n"} { incr rc } } return $rc } # [Michael A. Cleverly] proc strfirst {string} { set rc [set ndx 0] while {[set ndx [expr { [string first \n $string $ndx] + 1}]]} { incr rc } return $rc } puts "splitline [time {splitline $mystring} 10000]" puts "splitline2 [time {splitline2 $mystring} 10000]" puts "regline [time {regline $mystring} 10000]" puts "regsubline [time {regsubline $mystring} 10000]" puts "splitchar [time {splitchar $mystring} 10000]" puts "splitchar2 [time {splitchar2 $mystring} 10000]" puts "strmap [time {strmap $mystring} 10000]" puts "countCharA [time {countCharsA "\n" $mystring} 10000]" puts "strfirst [time {strfirst $mystring} 10000]" Tcl 8.4.0.1 ----------- splitline 265 microseconds per iteration splitline2 307 microseconds per iteration regline 1798 microseconds per iteration regsubline 1996 microseconds per iteration splitchar 2672 microseconds per iteration splitchar2 2872 microseconds per iteration strmap 141 microseconds per iteration countCharA 1781 microseconds per iteration strfirst 602 microseconds per iteration On Win2k splitline 160 microseconds per iteration splitline2 192 microseconds per iteration regline 1310 microseconds per iteration regsubline 1410 microseconds per iteration splitchar 1461 microseconds per iteration splitchar2 1692 microseconds per iteration strmap 55 microseconds per iteration countCharA 1305 microseconds per iteration strfirst 295 microseconds per iteration Tcl 8.3.4.4 ----------- splitline 284 microseconds per iteration splitline2 324 microseconds per iteration regline 1814 microseconds per iteration regsubline 2082 microseconds per iteration splitchar 8723 microseconds per iteration splitchar2 3204 microseconds per iteration strmap 597 microseconds per iteration countCharA 1810 microseconds per iteration strfirst 4185 microseconds per iteration On Win2k splitline 163 microseconds per iteration splitline2 190 microseconds per iteration regline 1324 microseconds per iteration regsubline 1434 microseconds per iteration splitchar 3786 microseconds per iteration splitchar2 1969 microseconds per iteration strmap 242 microseconds per iteration countCharA 1342 microseconds per iteration strfirst 3027 microseconds per iteration As an aside, this illustrates the speed improvements in Tcl 8.4 I went ahead and added 200 more \n to the string (string repeat line1\n\n\n)...on a suggestion from [David Welton]. Some of the procs do not scale well at all. Here are the results on Linux: /usr/local/ActiveTcl8.4.0.1/bin/tclsh splitline 403 microseconds per iteration splitline2 349 microseconds per iteration regline 4894 microseconds per iteration regsubline 5254 microseconds per iteration splitchar 3280 microseconds per iteration splitchar2 3701 microseconds per iteration strmap 168 microseconds per iteration countCharA 4919 microseconds per iteration strfirst 1534 microseconds per iteration /usr/local/ActiveTcl8.3.4.4/bin/tclsh splitline 430 microseconds per iteration splitline2 354 microseconds per iteration regline 5039 microseconds per iteration regsubline 5495 microseconds per iteration splitchar 11297 microseconds per iteration splitchar2 4263 microseconds per iteration strmap 688 microseconds per iteration countCharA 4869 microseconds per iteration strfirst 14578 microseconds per iteration ---- [Stu] 2007-10-13 I've taken the above procs and tried to make them as similar as possible and then run some tests with various lengths of data. Some of the procs only count characters, others can count strings. Some procs have been adjusted for this. Note the difference in speed between splitchar2 and splitchar2a simply by using eq instead of ==. The overall winner is strmap. Also shown in the output is the result of each proc; they should be all the same. If I've gotten your name wrong of you feel I misrepresented you or your proc somehow please feel free to make corrections. Example output from my machine: 1 char search, 240000 chars string splitchar 238632 microseconds per iteration (40000) splitchar2 305889 microseconds per iteration (40000) splitchar2a 167124 microseconds per iteration (40000) strmapC 19843 microseconds per iteration (40000) splitline 45872 microseconds per iteration (40000) splitline2 42932 microseconds per iteration (40000) regline 192236 microseconds per iteration (40000) regsubline 32450 microseconds per iteration (40000) strmapS 19131 microseconds per iteration (40000) countCharA 185561 microseconds per iteration (40000) strfirst 61804 microseconds per iteration (40000) ---- 100 chars search, 11900 chars string regline 67589 microseconds per iteration (100) regsubline 544 microseconds per iteration (100) strmapS 138 microseconds per iteration (100) countCharA 64383 microseconds per iteration (100) strfirst 458 microseconds per iteration (100) The thing: #! /bin/sh # \ exec tclsh "$0" ${1+"$@"} proc splitline {string countChar} { set rc [llength [split $string $countChar]] incr rc -1 return $rc } proc regline {string countString} { set rc [regexp -line -all $countString $string] return $rc } proc splitchar {string countChar} { foreach x [split $string ""] {if {[catch {incr counters($x)}]} {set counters($x) 1}} set rc $counters($countChar) return $rc } # Brett Schwarz (split into char and string versions) proc strmapC {string countChar} { return [expr {[string length $string]-[string length [string map [list $countChar ""] $strin$ } proc strmapS {string countString} { return [expr {([string length $string]-[string length [string map [list $countString ""] $st$ } # Richard Suchenwirth interp alias {} countCharsA {} regexp -all # Michael A. Cleverly proc regsubline {string countString} { return [regsub -all -- $countString $string $countString string] } # Miguel Sofer proc splitchar2 {string countChar} { set rc 0 foreach x [split $string ""] { if {$x == $countChar} { incr rc } } return $rc } # Miguel Sofer (== changed to eq) proc splitchar2a {string countChar} { set rc 0 foreach x [split $string ""] { if {$x eq $countChar} { incr rc } } return $rc } # Michael A. Cleverly proc strfirst {string countString} { set rc [set ndx 0] while {[set ndx [expr { [string first $countString $string $ndx] + 1}]]} { incr rc } return $rc } proc go {string countString times} { if {[string length $countString] < 2} { puts "splitchar [time {splitchar $string $countString} $times] ([splitchar $string$ puts "splitchar2 [time {splitchar2 $string $countString} $times] ([splitchar2 $stri$ puts "splitchar2a [time {splitchar2a $string $countString} $times] ([splitchar2a $st$ puts "strmapC [time {strmapC $string $countString} $times] ([strmapC $string $co$ puts "splitline [time {splitline $string $countString} $times] ([splitline $string$ puts "splitline2 [time {splitline2 $string $countString} $times] ([splitline2 $stri$ } puts "regline [time {regline $string $countString} $times] ([regline $string $countStrin$ puts "regsubline [time {regsubline $string $countString} $times] ([regsubline $string $coun$ puts "strmapS [time {strmapS $string $countString} $times] ([strmapS $string $countStrin$ puts "countCharA [time {countCharsA $countString $string} $times] ([countCharsA $countStrin$ puts "strfirst [time {strfirst $string $countString} $times] ([strfirst $string $countStr$ } set times 1000 puts "Tcl patchlevel: [info patch]" for {set rep 100} {$rep <= 100} {incr rep 10} { set countString "\n" set string [string repeat "line1\nline2\nline3\nline4\n" $rep] puts "[string length $countString] char search, [string length $string] chars string" go $string $countString $times puts ---- set countString "line" set string [string repeat "line1\nline2\nline3\nline4\n" $rep] puts "[string length $countString] chars search, [string length $string] chars string" go $string $countString $times puts ---- set countString "\n" set string [string repeat [string repeat "line1\nline2\nline3\nline4\n" $rep] $rep] puts "[string length $countString] char search, [string length $string] chars string" go $string $countString $times puts ---- set countString [string repeat hovercraft 10] set string [string repeat "my $countString is full of eels" $rep] puts "[string length $countString] chars search, [string length $string] chars string" go $string $countString $times } # EOF ---- |[Category Performance]|[Category String Processing]|