Version 17 of Counting characters in a string

Updated 2005-12-17 00:17:11

Recently a discussion arose concerning a way of determining how many occurances of a character appeared in a string. Several methods were proposed:

Isn't this page a good reason why 'string count' should be added to Tcl?

 #! /usr/tcl84/bin/tclsh

 set mystring "line1\nline2\nline3\nline4\n"

 proc splitline {string} {
        set rc [llength [split $string "\n"]] 
        incr rc -1
        return $rc
 }

 proc regline {string} {
        set rc [regexp -line -all "\n" $string]
        return $rc
 }

 proc splitchar {string} {
 foreach x [split $string ""] {if {[catch {incr counters($x)}]} {set counters($x) 1}}
        set rc $counters(\n)
        return $rc
 } 

 puts [time {splitline $mystring} 10000]
 puts [time {regline $mystring} 10000]
 puts [time {splitchar $mystring} 10000]

On my SPARC Solaris, using Tcl 8.4, I get

 49 microseconds per iteration
 154 microseconds per iteration
 963 microseconds per iteration

as the result. - RS: Surprising, but good to know, that split is by far the fastest way...

Brett Schwarz added this proc as well:

 proc strmap {str} {
 return [expr {[string length $str]-[string length [string map {"\n" ""} $str]]}]
 };

on my SuSE Linux 7.3 machine (2X333MHz PII) with Tcl 8.4, I get these results:

 34 microseconds per iteration
 87 microseconds per iteration
 782 microseconds per iteration
 34 microseconds per iteration

So, it gives the same performance of the split version. However, if the string length is increased, to, let's say, 100 lineX's (i.e. 100 \n), then this proc using string map actually scales better. Here are the numbers with 100:

 263 microseconds per iteration
 1729 microseconds per iteration
 2573 microseconds per iteration
 141 microseconds per iteration

There is quite an improvement....

However, there seems to be a difference in interpreters used. Here is the same timing with 8.3.4.4:

 288 microseconds per iteration
 1775 microseconds per iteration
 8514 microseconds per iteration
 578 microseconds per iteration

Appears string map got byte compiled in 8.4 (according to Miguel Sofer)

RS contributes maybe not the fastest, but for now the shortest variation:

 interp alias {} countCharsA {} regexp -all

to be called with the wanted character and the string as additional arguments

Brett Schwarz Some people (please correct me, if I got the wrong person) added some other procs, and here they are with some new times:

# Michael A. Cleverly

 proc splitline2 {string} {
        return [expr {[llength [split $string {\n}]] - 1}]
 }

Shouldn't that be

     ... [split $string "\n"] ...

? As it stands it will split on backslashes and n's, but not on newlines. And the return is of course unnecessary. FW: Yeah, but return is slightly faster. Lars H: Is it? That's interesting! But why is it faster? FW: (belated response) I think it's because when a call is nested into a return now, the interpreter knows that this is the last command executed in the procedure and so it can clean up the stack earlier, or something like that. Pre-8.1 return was actually slower, because it wasn't a special compiled construct.

# Michael A. Cleverly

 proc regsubline {string} {
        return [regsub -all -- {\n} $string {\n} string]
 }

# Miguel Sofer

 proc splitchar2 string {
    set rc 0
    foreach x [split $string ""] {
       if {$x == "\n"} {
          incr rc
       }
     }
     return $rc
 }

# Michael A. Cleverly

 proc strfirst {string} {
        set rc [set ndx 0]
        while {[set ndx [expr { [string first \n $string $ndx] + 1}]]} {
            incr rc
        }
        return $rc
 }


 puts "splitline [time {splitline $mystring} 10000]"
 puts "splitline2 [time {splitline2 $mystring} 10000]"
 puts "regline [time {regline $mystring} 10000]"
 puts "regsubline [time {regsubline $mystring} 10000]"
 puts "splitchar [time {splitchar $mystring} 10000]"
 puts "splitchar2 [time {splitchar2 $mystring} 10000]"
 puts "strmap [time {strmap $mystring} 10000]"
 puts "countCharA [time {countCharsA "\n" $mystring} 10000]"
 puts "strfirst [time {strfirst $mystring} 10000]"

Tcl 8.4.0.1


 splitline   265 microseconds per iteration
 splitline2  307 microseconds per iteration
 regline    1798 microseconds per iteration
 regsubline 1996 microseconds per iteration
 splitchar  2672 microseconds per iteration
 splitchar2 2872 microseconds per iteration
 strmap      141 microseconds per iteration
 countCharA 1781 microseconds per iteration
 strfirst    602 microseconds per iteration

On Win2k

 splitline   160 microseconds per iteration
 splitline2  192 microseconds per iteration
 regline    1310 microseconds per iteration
 regsubline 1410 microseconds per iteration
 splitchar  1461 microseconds per iteration
 splitchar2 1692 microseconds per iteration
 strmap       55 microseconds per iteration
 countCharA 1305 microseconds per iteration
 strfirst    295 microseconds per iteration

Tcl 8.3.4.4


 splitline   284 microseconds per iteration
 splitline2  324 microseconds per iteration
 regline    1814 microseconds per iteration
 regsubline 2082 microseconds per iteration
 splitchar  8723 microseconds per iteration
 splitchar2 3204 microseconds per iteration
 strmap      597 microseconds per iteration
 countCharA 1810 microseconds per iteration
 strfirst   4185 microseconds per iteration

On Win2k

 splitline   163 microseconds per iteration
 splitline2  190 microseconds per iteration
 regline    1324 microseconds per iteration
 regsubline 1434 microseconds per iteration
 splitchar  3786 microseconds per iteration
 splitchar2 1969 microseconds per iteration
 strmap      242 microseconds per iteration
 countCharA 1342 microseconds per iteration
 strfirst   3027 microseconds per iteration

As an aside, this illustrates the speed improvements in Tcl 8.4

I went ahead and added 200 more \n to the string (string repeat line1\n\n\n)...on a suggestion from David Welton. Some of the procs do not scale well at all. Here are the results on Linux:

 /usr/local/ActiveTcl8.4.0.1/bin/tclsh
 splitline    403 microseconds per iteration
 splitline2   349 microseconds per iteration
 regline     4894 microseconds per iteration
 regsubline  5254 microseconds per iteration
 splitchar   3280 microseconds per iteration
 splitchar2  3701 microseconds per iteration
 strmap       168 microseconds per iteration
 countCharA  4919 microseconds per iteration
 strfirst    1534 microseconds per iteration


 /usr/local/ActiveTcl8.3.4.4/bin/tclsh
 splitline    430 microseconds per iteration
 splitline2   354 microseconds per iteration
 regline     5039 microseconds per iteration
 regsubline  5495 microseconds per iteration
 splitchar  11297 microseconds per iteration
 splitchar2  4263 microseconds per iteration
 strmap       688 microseconds per iteration
 countCharA  4869 microseconds per iteration
 strfirst   14578 microseconds per iteration

Category Performance - Category String Processing