Recently a discussion arose concerning a way of determining how many occurrences of a character appeared in a string. Several methods were proposed: * [regexp] * [split] * [foreach]/[split]/[incr] * [string map] Isn't this page a good reason why 'string count' should be added to Tcl? #! /usr/tcl84/bin/tclsh set mystring "line1\nline2\nline3\nline4\n" proc splitline {string} { set rc [llength [split $string "\n"]] incr rc -1 return $rc } proc regline {string} { set rc [regexp -line -all "\n" $string] return $rc } proc splitchar {string} { foreach x [split $string ""] {if {[catch {incr counters($x)}]} {set counters($x) 1}} set rc $counters(\n) return $rc } puts [time {splitline $mystring} 10000] puts [time {regline $mystring} 10000] puts [time {splitchar $mystring} 10000] On my SPARC Solaris, using Tcl 8.4, I get 49 microseconds per iteration 154 microseconds per iteration 963 microseconds per iteration as the result. - [RS]: Surprising, but good to know, that [split] is by far the fastest way... [Brett Schwarz] added this proc as well: proc strmap {str} { return [expr {[string length $str]-[string length [string map {"\n" ""} $str]]}] }; on my SuSE Linux 7.3 machine (2X333MHz PII) with Tcl 8.4, I get these results: 34 microseconds per iteration 87 microseconds per iteration 782 microseconds per iteration 34 microseconds per iteration So, it gives the same performance of the split version. However, if the string length is increased, to, let's say, 100 lineX's (i.e. 100 \n), then this proc using [string map] actually scales better. Here are the numbers with 100: 263 microseconds per iteration 1729 microseconds per iteration 2573 microseconds per iteration 141 microseconds per iteration There is quite an improvement.... However, there seems to be a difference in interpreters used. Here is the same timing with 8.3.4.4: 288 microseconds per iteration 1775 microseconds per iteration 8514 microseconds per iteration 578 microseconds per iteration Appears string map got byte compiled in 8.4 (according to [Miguel Sofer]) [RS] contributes maybe not the fastest, but for now the shortest variation: interp alias {} countCharsA {} regexp -all to be called with the wanted character and the string as additional arguments [Brett Schwarz] Some people (please correct me, if I got the wrong person) added some other procs, and here they are with some new times: # [Michael A. Cleverly] proc splitline2 {string} { return [expr {[llength [split $string {\n}]] - 1}] } ''Shouldn't that be'' ... [split $string "\n"] ... ''? As it stands it will split on backslashes and n's, but not on newlines. And the [return] is of course unnecessary.'' [FW]: Yeah, but return is slightly faster. ''[Lars H]: Is it? That's interesting! But '''why''' is it faster?'' [FW]: (belated response) I think it's because when a call is nested into a return now, the interpreter knows that this is the last command executed in the procedure and so it can clean up the stack earlier, or something like that. Pre-8.1 return was actually slower, because it wasn't a special compiled construct. # [Michael A. Cleverly] proc regsubline {string} { return [regsub -all -- {\n} $string {\n} string] } # [Miguel Sofer] proc splitchar2 string { set rc 0 foreach x [split $string ""] { if {$x == "\n"} { incr rc } } return $rc } # [Michael A. Cleverly] proc strfirst {string} { set rc [set ndx 0] while {[set ndx [expr { [string first \n $string $ndx] + 1}]]} { incr rc } return $rc } puts "splitline [time {splitline $mystring} 10000]" puts "splitline2 [time {splitline2 $mystring} 10000]" puts "regline [time {regline $mystring} 10000]" puts "regsubline [time {regsubline $mystring} 10000]" puts "splitchar [time {splitchar $mystring} 10000]" puts "splitchar2 [time {splitchar2 $mystring} 10000]" puts "strmap [time {strmap $mystring} 10000]" puts "countCharA [time {countCharsA "\n" $mystring} 10000]" puts "strfirst [time {strfirst $mystring} 10000]" Tcl 8.4.0.1 ----------- splitline 265 microseconds per iteration splitline2 307 microseconds per iteration regline 1798 microseconds per iteration regsubline 1996 microseconds per iteration splitchar 2672 microseconds per iteration splitchar2 2872 microseconds per iteration strmap 141 microseconds per iteration countCharA 1781 microseconds per iteration strfirst 602 microseconds per iteration On Win2k splitline 160 microseconds per iteration splitline2 192 microseconds per iteration regline 1310 microseconds per iteration regsubline 1410 microseconds per iteration splitchar 1461 microseconds per iteration splitchar2 1692 microseconds per iteration strmap 55 microseconds per iteration countCharA 1305 microseconds per iteration strfirst 295 microseconds per iteration Tcl 8.3.4.4 ----------- splitline 284 microseconds per iteration splitline2 324 microseconds per iteration regline 1814 microseconds per iteration regsubline 2082 microseconds per iteration splitchar 8723 microseconds per iteration splitchar2 3204 microseconds per iteration strmap 597 microseconds per iteration countCharA 1810 microseconds per iteration strfirst 4185 microseconds per iteration On Win2k splitline 163 microseconds per iteration splitline2 190 microseconds per iteration regline 1324 microseconds per iteration regsubline 1434 microseconds per iteration splitchar 3786 microseconds per iteration splitchar2 1969 microseconds per iteration strmap 242 microseconds per iteration countCharA 1342 microseconds per iteration strfirst 3027 microseconds per iteration As an aside, this illustrates the speed improvements in Tcl 8.4 I went ahead and added 200 more \n to the string (string repeat line1\n\n\n)...on a suggestion from [David Welton]. Some of the procs do not scale well at all. Here are the results on Linux: /usr/local/ActiveTcl8.4.0.1/bin/tclsh splitline 403 microseconds per iteration splitline2 349 microseconds per iteration regline 4894 microseconds per iteration regsubline 5254 microseconds per iteration splitchar 3280 microseconds per iteration splitchar2 3701 microseconds per iteration strmap 168 microseconds per iteration countCharA 4919 microseconds per iteration strfirst 1534 microseconds per iteration /usr/local/ActiveTcl8.3.4.4/bin/tclsh splitline 430 microseconds per iteration splitline2 354 microseconds per iteration regline 5039 microseconds per iteration regsubline 5495 microseconds per iteration splitchar 11297 microseconds per iteration splitchar2 4263 microseconds per iteration strmap 688 microseconds per iteration countCharA 4869 microseconds per iteration strfirst 14578 microseconds per iteration ---- |[Category Performance]|[Category String Processing]|