Version 25 of Counting characters in a string

Updated 2007-10-14 01:42:38 by Stu

Recently a discussion arose concerning a way of determining how many occurrences of a character appeared in a string. Several methods were proposed:

Isn't this page a good reason why 'string count' should be added to Tcl? - RS 2007-1013: This wiki has some 20,000 pages, many with one or more useful algorithms. But I wouldn't want them all drawn into the Tcl core... (Tcllib maybe, although that is already too crowded in some parts). If you can implement a (sometimes) useful function in one line of code, just do it locally :^) - but let's keep Tcl simple!

 #! /usr/tcl84/bin/tclsh

 set mystring "line1\nline2\nline3\nline4\n"

 proc splitline {string} {
        set rc [llength [split $string "\n"]] 
        incr rc -1
        return $rc
 }

 proc regline {string} {
        set rc [regexp -line -all "\n" $string]
        return $rc
 }

 proc splitchar {string} {
 foreach x [split $string ""] {if {[catch {incr counters($x)}]} {set counters($x) 1}}
        set rc $counters(\n)
        return $rc
 } 

 puts [time {splitline $mystring} 10000]
 puts [time {regline $mystring} 10000]
 puts [time {splitchar $mystring} 10000]

On my SPARC Solaris, using Tcl 8.4, I get

 49 microseconds per iteration
 154 microseconds per iteration
 963 microseconds per iteration

as the result. - RS: Surprising, but good to know, that split is by far the fastest way...

Brett Schwarz added this proc as well:

 proc strmap {str} {
 return [expr {[string length $str]-[string length [string map {"\n" ""} $str]]}]
 };

on my SuSE Linux 7.3 machine (2X333MHz PII) with Tcl 8.4, I get these results:

 34 microseconds per iteration
 87 microseconds per iteration
 782 microseconds per iteration
 34 microseconds per iteration

So, it gives the same performance of the split version. However, if the string length is increased, to, let's say, 100 lineX's (i.e. 100 \n), then this proc using string map actually scales better. Here are the numbers with 100:

 263 microseconds per iteration
 1729 microseconds per iteration
 2573 microseconds per iteration
 141 microseconds per iteration

There is quite an improvement....

However, there seems to be a difference in interpreters used. Here is the same timing with 8.3.4.4:

 288 microseconds per iteration
 1775 microseconds per iteration
 8514 microseconds per iteration
 578 microseconds per iteration

Appears string map got byte compiled in 8.4 (according to Miguel Sofer)

RS contributes maybe not the fastest, but for now the shortest variation:

 interp alias {} countCharsA {} regexp -all

to be called with the wanted character and the string as additional arguments

Brett Schwarz Some people (please correct me, if I got the wrong person) added some other procs, and here they are with some new times:

# Michael A. Cleverly

 proc splitline2 {string} {
        return [expr {[llength [split $string {\n}]] - 1}]
 }

Shouldn't that be

     ... [split $string "\n"] ...

? As it stands it will split on backslashes and n's, but not on newlines. And the return is of course unnecessary. FW: Yeah, but return is slightly faster. Lars H: Is it? That's interesting! But why is it faster? FW: (belated response) I think it's because when a call is nested into a return now, the interpreter knows that this is the last command executed in the procedure and so it can clean up the stack earlier, or something like that. Pre-8.1 return was actually slower, because it wasn't a special compiled construct.

# Michael A. Cleverly

 proc regsubline {string} {
        return [regsub -all -- {\n} $string {\n} string]
 }

# Miguel Sofer

 proc splitchar2 string {
    set rc 0
    foreach x [split $string ""] {
       if {$x == "\n"} {
          incr rc
       }
     }
     return $rc
 }

# Michael A. Cleverly

 proc strfirst {string} {
        set rc [set ndx 0]
        while {[set ndx [expr { [string first \n $string $ndx] + 1}]]} {
            incr rc
        }
        return $rc
 }


 puts "splitline [time {splitline $mystring} 10000]"
 puts "splitline2 [time {splitline2 $mystring} 10000]"
 puts "regline [time {regline $mystring} 10000]"
 puts "regsubline [time {regsubline $mystring} 10000]"
 puts "splitchar [time {splitchar $mystring} 10000]"
 puts "splitchar2 [time {splitchar2 $mystring} 10000]"
 puts "strmap [time {strmap $mystring} 10000]"
 puts "countCharA [time {countCharsA "\n" $mystring} 10000]"
 puts "strfirst [time {strfirst $mystring} 10000]"

Tcl 8.4.0.1


 splitline   265 microseconds per iteration
 splitline2  307 microseconds per iteration
 regline    1798 microseconds per iteration
 regsubline 1996 microseconds per iteration
 splitchar  2672 microseconds per iteration
 splitchar2 2872 microseconds per iteration
 strmap      141 microseconds per iteration
 countCharA 1781 microseconds per iteration
 strfirst    602 microseconds per iteration

On Win2k

 splitline   160 microseconds per iteration
 splitline2  192 microseconds per iteration
 regline    1310 microseconds per iteration
 regsubline 1410 microseconds per iteration
 splitchar  1461 microseconds per iteration
 splitchar2 1692 microseconds per iteration
 strmap       55 microseconds per iteration
 countCharA 1305 microseconds per iteration
 strfirst    295 microseconds per iteration

Tcl 8.3.4.4


 splitline   284 microseconds per iteration
 splitline2  324 microseconds per iteration
 regline    1814 microseconds per iteration
 regsubline 2082 microseconds per iteration
 splitchar  8723 microseconds per iteration
 splitchar2 3204 microseconds per iteration
 strmap      597 microseconds per iteration
 countCharA 1810 microseconds per iteration
 strfirst   4185 microseconds per iteration

On Win2k

 splitline   163 microseconds per iteration
 splitline2  190 microseconds per iteration
 regline    1324 microseconds per iteration
 regsubline 1434 microseconds per iteration
 splitchar  3786 microseconds per iteration
 splitchar2 1969 microseconds per iteration
 strmap      242 microseconds per iteration
 countCharA 1342 microseconds per iteration
 strfirst   3027 microseconds per iteration

As an aside, this illustrates the speed improvements in Tcl 8.4

I went ahead and added 200 more \n to the string (string repeat line1\n\n\n)...on a suggestion from David Welton. Some of the procs do not scale well at all. Here are the results on Linux:

 /usr/local/ActiveTcl8.4.0.1/bin/tclsh
 splitline    403 microseconds per iteration
 splitline2   349 microseconds per iteration
 regline     4894 microseconds per iteration
 regsubline  5254 microseconds per iteration
 splitchar   3280 microseconds per iteration
 splitchar2  3701 microseconds per iteration
 strmap       168 microseconds per iteration
 countCharA  4919 microseconds per iteration
 strfirst    1534 microseconds per iteration


 /usr/local/ActiveTcl8.3.4.4/bin/tclsh
 splitline    430 microseconds per iteration
 splitline2   354 microseconds per iteration
 regline     5039 microseconds per iteration
 regsubline  5495 microseconds per iteration
 splitchar  11297 microseconds per iteration
 splitchar2  4263 microseconds per iteration
 strmap       688 microseconds per iteration
 countCharA  4869 microseconds per iteration
 strfirst   14578 microseconds per iteration

Stu 2007-10-13 I've taken the above procs and tried to make them as similar as possible and then run some tests with various lengths of data. Some of the procs only count characters, others can count strings. Some procs have been adjusted for this. Note the difference in speed between splitchar2 and splitchar2a simply by using eq instead of ==. The overall winner is strmap. Also shown in the output is the result of each proc; they should be all the same. If I've gotten your name wrong or you feel I misrepresented you or your proc somehow please make corrections.

Example output from my machine:

 1 char search, 240000 chars string
 splitchar   238632 microseconds per iteration (40000)
 splitchar2  305889 microseconds per iteration (40000)
 splitchar2a 167124 microseconds per iteration (40000)
 strmapC     19843 microseconds per iteration (40000)
 splitline   45872 microseconds per iteration (40000)
 splitline2  42932 microseconds per iteration (40000)
 regline     192236 microseconds per iteration (40000)
 regsubline  32450 microseconds per iteration (40000)
 strmapS     19131 microseconds per iteration (40000)
 countCharA  185561 microseconds per iteration (40000)
 strfirst    61804 microseconds per iteration (40000)
 ----
 100 chars search, 11900 chars string
 regline     67589 microseconds per iteration (100)
 regsubline  544 microseconds per iteration (100)
 strmapS     138 microseconds per iteration (100)
 countCharA  64383 microseconds per iteration (100)
 strfirst    458 microseconds per iteration (100)

The procs:

 proc splitline {string countChar} {
         set rc [llength [split $string $countChar]]
         incr rc -1
         return $rc
 }

 proc regline {string countString} {
         set rc [regexp -line -all $countString $string]
         return $rc
 }

 proc splitchar {string countChar} {
         foreach x [split $string ""] {if {[catch {incr counters($x)}]} {set counters($x) 1}}
         set rc $counters($countChar)
         return $rc
 }

 # Brett Schwarz (split into char and string versions)
 proc strmapC {string countChar} {
         return [expr {[string length $string]-[string length [string map [list $countChar ""] $string]]}]
 }
 proc strmapS {string countString} {
         return [expr {([string length $string]-[string length [string map [list $countString ""] $string]]) / [string length countString]}]
 }

 # Richard Suchenwirth
 interp alias {} countCharsA {} regexp -all

 # Michael A. Cleverly
 proc regsubline {string countString} {
         return [regsub -all -- $countString $string $countString string]
 }

 # Miguel Sofer
 proc splitchar2 {string countChar} {
         set rc 0
         foreach x [split $string ""] {
                 if {$x == $countChar} {
                         incr rc
                 }
         }
         return $rc   
 }
 # Miguel Sofer (== changed to eq)
 proc splitchar2a {string countChar} {                                                                
         set rc 0
         foreach x [split $string ""] {
                 if {$x eq $countChar} {
                         incr rc
                 }
         }
         return $rc
 }

 # Michael A. Cleverly
 proc strfirst {string countString} {   
         set rc [set ndx 0]
         while {[set ndx [expr { [string first $countString $string $ndx] + 1}]]} {
                 incr rc
         }
         return $rc   
 }

 proc go {string countString times} {
         if {[string length $countString] < 2} {
                 puts "splitchar   [time {splitchar $string $countString} $times] ([splitchar $string $countString])"
                 puts "splitchar2  [time {splitchar2 $string $countString} $times] ([splitchar2 $string $countString])"
                 puts "splitchar2a [time {splitchar2a $string $countString} $times] ([splitchar2a $string $countString])"
                 puts "strmapC     [time {strmapC $string $countString} $times] ([strmapC $string $countString])"
                 puts "splitline   [time {splitline $string $countString} $times] ([splitline $string $countString])"
                 puts "splitline2  [time {splitline2 $string $countString} $times] ([splitline2 $string $countString])"
         }
         puts "regline     [time {regline $string $countString} $times] ([regline $string $countString])"
         puts "regsubline  [time {regsubline $string $countString} $times] ([regsubline $string $countString])"
         puts "strmapS     [time {strmapS $string $countString} $times] ([strmapS $string $countString])"
         puts "countCharA  [time {countCharsA $countString $string} $times] ([countCharsA $countString $string])"
         puts "strfirst    [time {strfirst $string $countString} $times] ([strfirst $string $countString])"
 }

The thing:

 set times 1000

 puts "Tcl patchlevel: [info patch]"

 for {set rep 100} {$rep <= 100} {incr rep 10} {
         set countString "\n"
         set string [string repeat "line1\nline2\nline3\nline4\n" $rep]
         puts "[string length $countString] char search, [string length $string] chars string"
         go $string $countString $times
         puts ----
         set countString "line"
         set string [string repeat "line1\nline2\nline3\nline4\n" $rep]
         puts "[string length $countString] chars search, [string length $string] chars string"
         go $string $countString $times
         puts ----
         set countString "\n"
         set string [string repeat [string repeat "line1\nline2\nline3\nline4\n" $rep] $rep]
         puts "[string length $countString] char search, [string length $string] chars string"        
         go $string $countString $times
         puts ----
         set countString [string repeat hovercraft 10]
         set string [string repeat "my $countString is full of eels" $rep]
         puts "[string length $countString] chars search, [string length $string] chars string"
         go $string $countString $times
 }

Category PerformanceCategory String Processing