regexp - Match a regular expression against a string
Determines whether the regular expression expr matches part or all of string. Returns 1 if it does, 0 if it doesn't.
Any additional arguments specified after string are the names of variables in which to return information about the captured match at the corresponding position in the string. MatchVar will be set to the range of string that matched all of expr. The first subMatchVar will contain the match of the first parenthesized subexpression within expr. The next subMatchVar will contain the characters that matched the next parenthesized subexpression to the right in exp, and so on.
If the initial arguments to [regexp]] start with - then they are treated as switches. The following switches are currently supported:
If there are more subMatchVars than parenthesized subexpressions within exp, or if a particular subexpression in exp doesn't match the string (e.g. because it was in a portion of the expression that wasn't matched), then the corresponding subMatchVar will be set to "-1 -1" if -indices' has been specified or to an empty string otherwise. (From: TclHelp 8.2.3)
puts {enter string:} set input [read stdin] if {[regexp {abc} $input]} { puts yes } else { puts no }
More info about the return values from -about, written by DKF in Feb, 2007 (with further additions and clarifications by DKF from a bit later in italics):
" currently only exist for testing purposes. Going through the definitive list, I see:
If you're not an RE wonk or matcher, I'd assert that virtually all of these are totally uninteresting. :-) The backrefs, lookahead and bounds are probably most interesting from a "describing what's in there" POV."
I can't see any value in UNONPOSIX, UUNSPEC, UUNPORT or ULOCALE; they just don't seem to correspond to any question I might ever wish to ask about a regular expression. UBSALNUM and UPBOTCH are very low-value too, as they only apply when you move the RE engine into a non-standard mode.
Saravanan: Can any one tell how to retrieve the count of a particular character from the given string (using regexp only)? Eg: set a "hithisisisis". i need to find how many occurrences of 'i' from $a.
Lars H: Use the -all option:
% regexp -all i $a 5
Feb 9th 2007 CJL wondered on Ask#5 what the correct/best/proper way of writing a regexp with quotes and the current value of a variable in the expression was? I want to match various patterns of the form <INPUT TYPE="TEXT" NAME="$something" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">, where $something has a range of values that is a subset of all possible values, i.e. I don't want to put \S+ in place of $something as that will give unwanted matches. Note the presence of quotes and escapes to complicate things.
MG Using format is probably one of the simplest.
set something "foobar" set pattern {<INPUT TYPE="TEXT" NAME="%s" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">} set pattern [format $pattern $something]
Assuming, of course, you don't have %-'s in your string. Otherwise, building it in steps may be easiest:
set something "foobar" set pattern {<INPUT TYPE="TEXT" NAME="} append pattern $something append pattern {" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">}
LV I suspect the OP will need to replace those \d with %d and the \S with %s.
Tcl dynamically caches the compiled regular expressions. The Tcl core caches the last 30 REs it compiled but you can cause an number of RE's to be cached by assigning them to variables. If a regular expression is assigned to a variable and the variable is not changed, the Tcl core will save the compiled version of the RE and use the precompiled version of the variable during next evaluation. In the core the compiled version of the RE is stored in the Tcl_Obj, along with its string representation.
To find #pragma <something> statements define a pattern like
set re {^\s*#\s*pragma\s+(.)} if { [regexp $re $line -> rest] } { ... }
The above example will cause the compiled regular expression to be stored $re.
From possible to precompile regexps ,comp.lang.tcl, 2004-11-04.
The run-time benefit of regular expression caching can easily be shown:
# Run N different regexp patterns proc test_regexps N { for {set i 0} {$i < $N} {incr i} { regexp "foobar$i" "foobar1" } } puts "29 Took: [time { test_regexps 29 } 100]" puts "30 Took: [time { test_regexps 30 } 100]" puts "31 Took: [time { test_regexps 31 } 100]" puts "32 Took: [time { test_regexps 32 } 100]"
One run of this gave:
29 Took: 298 microseconds per iteration 30 Took: 372 microseconds per iteration 31 Took: 2000 microseconds per iteration 32 Took: 2107 microseconds per iteration
...clearly showing the extra cost of having to recompile each regexp pattern each time through' due to exceeding NUM_REGEXPS (30).
DKF writes that it is hard to do this with any single RE on its own, though you can do it quite easily using a couple of things coupled together. This example uses regsub to strip the problematic lines, but cannot completely get rid of leading and trailing newlines without the extra string trim:
string trim [regsub -all {\n(?:\s*\n)+} $data \n] \n
However, I prefer selecting things positively, leading to a solution using regexp and join:
join [regexp -all -inline {(?=[^\n]*\S)[^\n]+} $data] \n
DKF 2006-08-10: More experimentation indicates that a single [regsub] can do the whole job:
regsub -all {^\n+|\n+$|(\n)+} $data {\1}
Note that the order of the alternatives is important!