A regexp extension

NJG January 15, 2005

Recently I posted the page A regexp twist. Not having any feedback I have no idea how much attention it has received. It has, however, suddenly occured to me that neither the title nor the style of the prose were very advertising of its essence. So here is a more direct try.

A regexp twist provides an extensiont to the functionality of the regexp command in the form of

regexp -inline ?other options? pattern string script

where script would be executed each time a pattern match occured. (Note that in current Tcl this is illegal so it represents a compatible extension.) The result of the actual match is available to script in the global match variables mVar0, mVar1, .... mVar9.

For Windows users the downloadable zip file contains the source and the extension dll.

Those on Linux may either replace the .dll specific part of the source with whatever is needed for compiling it into an .so module or replace the function Tcl_RegexpObjCmd in file tclCmdMZ.c of their Tcl source distribution with the one in the provided source and recompile Tcl. Lacking time I cannot help those who need a Linux binary.

Please note: I have no idea which is the oldest Tcl version number for which this extension works. For Tcl 8.4.x it surely does.


NJG January 23, 2005

A speed tuned version can now be downloaded from A regexp twist!


MG takes a quick shot at this in pure-Tcl on 8.4.9 ...

 proc regexpScriptPre8.5 {args} {

   if { [llength $args] < 3 } {
        error "wrong # args"
      }
   set rArgs [lrange $args 0 end-3]
   set cmd [lindex $args end]
   eval "foreach x \[regexp -inline $rArgs \[lindex \$args end-2\] \[lindex \$args end-1\]\] \{ \
   uplevel 1 \[list [list $cmd]\] \[list \$x\] \
   \}"
 }

Or, in 8.5 simplified with {*} (though untested as I don't have 8.5)

 proc regexpScript8.5 {args} {
   if { [llength $args] < 3 } {
        error "wrong # args"
      }
   set rArgs [lrange $args 0 end-3]
   set cmd [lindex $args end]
   foreach x [regexp -inline {*}$rArgs [lindex $args end-2] [lindex $args end-1]] {
      uplevel 1 [list $cmd] [list $x]
   }
 }

Lars H: A less Quoting hell backport of the 8.5 version to 8.4:

 proc regexpScript {args} {
   if { [llength $args] < 3 } {
        error "wrong # args"
      }
   set rArgs [lrange $args 0 end-3]
   set cmd [lindex $args end]
   foreach x [eval [list regexp -inline] $rArgs [lrange $args end-2 end-1]] {
      uplevel 1 [list $cmd] [list $x]
   }
 }

which can of course be optimised further still by concatenating the lranges. But note that these do not do the same as the thing at the top of the page; here the last argument is a command prefix, but the compiled command is supposed to take an arbitrary script that accesses the match in global variables.


MG Just decided to do a quick test to see what, if anything, the speed difference was...

 (Desktop) 6 % time {regexpScriptPre8.5 -all . {This is a test string} bleh} 500
 635 microseconds per iteration
 (Desktop) 7 % time {regexpScriptPre8.5 -all . {This is a test string} bleh} 5000
 664 microseconds per iteration
 (Desktop) 8 % time {regexpScriptPre8.5 -all . {This is a test string} bleh} 50000
 730 microseconds per iteration
 (Desktop) 9 % load xregexp.dll
 Extended regexp handling is in place
 (Desktop) 10 % time {regexp -inline -all . {This is a  test string} bleh} 500
 897 microseconds per iteration
 (Desktop) 11 % time {regexp -inline -all . {This is a  test string} bleh} 5000
 903 microseconds per iteration
 (Desktop) 12 % time {regexp -inline -all . {This is a  test string} bleh} 50000
 1017 microseconds per iteration

As you can see, I did the tests using the pre-8.5 version (with eval), rather than the {*} version. All the tests were done on Tcl 8.4.9, and the Tcl-only version used the "normal" regexp; ie, I only loaded NJG's package after I'd tested the plain-tcl code. Suprisingly, the plain-Tcl version comes out slightly faster. (Oh, the 'bleh' script used there was just:

 proc bleh {x} {
   set ::tmp $x; return
 }

NJG January 19, 2005

MG, I would not say that a 40-50% difference is slight, so I looked into the code again. I found that the major part of the difference must come from my code saving 10 match variables at each match while your regexp -inline creates a sublist of only as many elements as there are subexpression matches. In the actual test 9 of the saves are superfluous!

It is easy to remedy this, so I will shortly post the corrected version.

I stated in the original posting (A regexp twist) that it is the least effective way to execute the script by Tcl_Eval at each match, as it is done in the code now. However, its effect is the least pronounced when the script consists of only a single parameterless procedure call, as is the case in your test.

Anyway, I did this hack as it was easy and I found the result aesthetically pleasing. Perhaps I should create the solution which is efficient as well ...

Finally, your test does not take into account the time needed for fetching the match and submatch values from the list representation returned by regexp -inline! (The time of at least one set <varname> [lindex $x <index>] or lindex $x <index> command).

Thanks for the feedback!