Match a regular expression against a string  

----

http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/regexp.htm

----

'''SYNOPSIS'''

'''regexp''' ''?switches? exp string ?matchVar? ?subMatchVar subMatchVar ...?  ''

'''DESCRIPTION'''

Determines whether the regular expression ''exp'' matches part or all of ''string'' and returns 1 if it does, 0 if it doesn't. (Regular expression matching is described in the [re_syntax] reference page.) 

If additional arguments are specified after ''string'' then they are treated as the names of variables in which to return information about which part(s) of ''string'' matched ''exp''. ''MatchVar'' will be set to the range of ''string'' that matched all of ''exp''.  The first ''subMatchVar'' will contain the characters in ''string'' that matched the leftmost parenthesized subexpression within ''exp'', the next ''subMatchVar'' will contain the characters that matched the next parenthesized subexpression to the right in ''exp'', and so on. 

If the initial arguments to '''regexp''' start with '''-''' then they are treated as switches. The following switches are currently supported: 

   '''-about''':   Instead of attempting to match the regular expression, returns a list containing information about the regular expression. The first element of the list is a subexpression count. The second element is a list of property names that describe various attributes of the regular expression. This switch is primarily intended for debugging purposes.

   '''-expanded''':   Enables use of the expanded regular expression syntax where whitespace and comments are ignored.  This is the same as specifying the (?x) embedded option (see METASYNTAX, below).

   '''-indices''':   Changes what is stored in the subMatchVars.  Instead of storing the matching characters from string, each variable will contain a list of two decimal strings giving the indices in string of the first and last characters in the matching range of characters. 

   '''-line''':   Enables newline-sensitive matching.  By default, newline is a completely ordinary character with no special meaning.  With this flag, `[^' bracket expressions and `.' never match newline, `^' matches an empty string after any newline in addition to its normal function, and `$' matches an empty string before any newline in addition to its normal function.  This flag is equivalent to specifying both -linestop and -lineanchor, or the (?n) embedded option (see METASYNTAX, below). 

   '''-linestop''':   Changes the behavior of `[^' bracket expressions and `.' so that they stop at newlines.  This is the same as specifying the (?p) embedded option (see METASYNTAX, below). 

   '''-lineanchor''':   Changes the behavior of `^' and `$' (the "anchors") so they match the beginning and end of a line respectively.  This is the same as specifying the (?w) embedded option (see METASYNTAX, below). 

   '''-nocase''':   Causes upper-case characters in string to be treated as lower case during the matching process.

   '''-start''' ''index'':   Specifies a character index offset into the string to start matching the regular expression at. When using this switch, `^' will not match the beginning of the line, and \A will still match the start of the string at index. If '''-indices''' is specified, the indices will be indexed starting from the absolute beginning of the input string. index will be constrained to the bounds of the input string.

   '''--''':   Marks the end of switches. The argument following this one will be treated as ''exp'' even if it starts with a -. 

If there are more ''subMatchVar'''s than parenthesized subexpressions within ''exp'', or if a particular subexpression in ''exp'' doesn't match the string (e.g. because it was in a portion of the expression that wasn't matched), then the corresponding ''subMatchVar'' will be set to "-1 -1" if '''-indices''' has been specified or to an empty string otherwise.  (From: [TclHelp] 8.2.3)

----
More info about the return values from ''-about'', written by [DKF] in Feb, 2007:

" currently only exist for testing purposes. Going through the definitive list, I see:

   REG_UBACKREF:   Indicates that the RE contains backreferences, which forces a more expensive evaluation engine. (Note that this implies that there must be capturing parens, but there is no flag to indicate that.)
   REG_ULOOKAHEAD:   Indicates that the RE contains lookahead constraints.
   REG_UBOUNDS:   Indicates that the RE contains bounded matches (i.e. counted ranges expressed in the form {m,n})
   REG_UBRACES:   Indicates that the RE contains braces that are not bounds.
   REG_UBSALNUM:   Indicates that there's a rich backslash-alphanumeric sequence. Only happens when switched to parsing non-advanced REs.
   REG_UPBOTCH:   Indicates an unbalanced close-parenthesis ("specification botch" according to a comment in the source!)
   REG_UBBS:   Indicates that there is a backslash inside a bracketed character set.
   REG_UNONPOSIX:   Indicates that the RE is not a POSIX RE.
   REG_UUNSPEC:   Indicates that the RE is asking for unspecified behaviour?
   REG_UUNPORT:   Indicates that the RE is unportable?
   REG_ULOCALE:   Indicates that the RE is (potentially) dependent on the locale.
   REG_UEMPTYMATCH:   Indicates that the empty string is matched by the RE.
   REG_UIMPOSSIBLE:   Indicates that the RE cannot possibly match anything. (Not all "impossible" REs are detected though.)
   REG_USHORTEST:   Indicates that the RE is non-greedy, and so uses a different matching engine.

If you're not an RE wonk or matcher, I'd assert that virtually all of these are totally uninteresting. :-) The backrefs, lookahead and bounds are probably most interesting from a "describing what's in there" POV."


----

'''METASYNTAX''', anyone?

[re_syntax] covers the regular expression syntax, right?

----

someone needs to write up greedy vs non-greedy re issues


[MG] OK, I'm sure someone can do this better than me, but since nothing's here at the moment I'll make a start...

By default, the regexp characters ''+'' and ''*'' match as much as possible (which is called greedy matching). By placing a ''?'' after them, you can make them match as little as possible (non-greedy). For example...

  regexp "a.+3" "abc123abc123" var
  set var

would show the match as ''abc123abc123'', because the + is matches all the characters up until the last 3. If you used...

  regexp "a.+?3" "abc123abc123" var
  set var

you'd see the match as ''abc123'' because +? matches as little as possible. Greedy regexp matching is a particular problem in parsing HTML, etc, because...

  set str "<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>"
  regexp "<b>(.*)</b>" $str -> var
  set var

would show ''Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text'' - matching as much as possible, it takes between the first occurance of <b> and the ''last'' occurance of </b>. But, using a non-greedy regexp to match...

  set str "<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>"
  regexp "<b>(.*?)</b>" $str -> var
  set var

would show what you want; ''Some Bold Text''. Hope that explanation/rambling is some use, at least until someone with more idea what they're doing puts something up :)

[AvL] I'll now mention some common pitfall with non-greedy REs:
Lets go back to the first example, but with a modified string:
  regexp "a.+?3" "abc123ax3" var
  set var
Although second possible match ''ax3'' would be shorter, it will still find the
first match ''abc123'', because even with non-greedy quantifiers, the first 
match always wins.

----

Could someone replace this line with some verbage regarding the way one uses regular expressions for specific newline carriage return handling (as opposed to the use of the $ metacharacter)?

[Janos Holanyi]: I would really need to build up a re that would match one line and only one line - that is, excluding carriege-return-newline's (\r\n) from matching... How would such a re look like?

----

[LV] how about something like this?
 set a "abc
dev"
 # a now has two lines in it
 regexp -line -- {(.*)} $a b c d
 1
 puts $b
 abc
 puts $c
 abc

Note that if you want to keep carriage returns or newlines by themselves, but not when they are together, you need something like:

  regexp --  {^([^\r]|\r(?!\n))*}  $a b c d

This allows plain carriage return or plain newline.

Thanks to [bbh] and [Donal Fellows] for this regular expression.

----

From [the comp.lang.tcl newsgroup]:
I did some experimenting with other strings, like "just a HHHHEEEEAAAADDDDEEEERRRR". The regular expression {(.)\1\1\1} does the job I would have wanted, whereas {(.){4}} will return the last of each four characters - as posted as well.

That surprised me too -- being able to place backreferences within the regex is an extremely powerful technique.

 regsub -all {(.)\1{3}} $string {\1} result

for exactly 4 char repeats, and {(.)\1+}    for arbitrary repeats

----

[Laurent Riesterer] has written a Visual Regexp tool
[http://laurent.riesterer.free.fr/regexp]
to help understand regexp operation.

----

[NEM] - A question on the Tclers Chat brought up a common problem that I've had when dealing with regular expressions. The RE engine allows [[^AB]] to mean "not A or B", but what if you want to match anything but the string "AB"? The only way to do it is to put lots of negated classes one after the other, which is ugly. So, here is a way to wrap that up into something a bit more elegant:

 proc not {pattern} {
     set ret "(?:"     ;# Not capturing bracket
     foreach char [split $pattern {}] {
         append ret "\[^$char\]"
     }
     append ret ")"
     return $ret
 }

Then you can do:
 regexp -- "AB([not AB]*)AB(.*)" ABcdefghABijklmnopqrst -> first rest
 first = "cdefgh"
 rest = "ijklmnopqrst"
And it handles things like:
 regexp -- "AB([not AB]*)AB(.*)" ABcdefghBAijkABlmnopqrst -> first ret
 first = "cdefghBAijk"
 rest = "lmnopqrst"

Note though, that this will only match patterns which are at least the same length as the negated expression:
 regexp -- "AB([not AB]*)AB(.*)" ABcABslkdjf -> first rest
 => 0

The proper solution to this problem is a lot more complex, unfortunately.

----

The above three regexp's can be written using a lookahead constraint.

 foreach str {ABcdefghABijklmnopqrst ABcdefghBAijkABlmnopqrst ABcABslkdjf} {
   set e "regexp -- {AB(?!AB)(.*)AB(.*)} $str -> first rest"
   puts "$e\n=> [eval $e]\nfirst = $first\nrest = $rest\n"
 }

Output:
 regexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghABijklmnopqrst -> first rest
 => 1
 first = cdefgh
 rest = ijklmnopqrst

 regexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghBAijkABlmnopqrst -> first rest
 => 1
 first = cdefghBAijk
 rest = lmnopqrst

 regexp -- {AB(?!AB)(.*)AB(.*)} ABcABslkdjf -> first rest
 => 1
 first = c
 rest = slkdjf

----

Important to note on the "[not pattern]" example above is that it will NOT match strings where there is an occurrence of the first letter from {pattern} when not part of the entirety of {pattern}:

 % regexp -- "AB([not AB]*)AB(.*)" ABcdefghAijklmABnopqrst -> first rest
 0
 % regexp -- "AB([not AB]*)AB(.*)" ABcdefghBijklmABnopqrst -> first rest
 1
 % set first
 cdefghBijklm
 % set rest
 nopqrst
----
[DKF]: It's actually fairly easy to request that an RE shouldn't match something. You just need some magic around it like this:
 regexp {^(?:(?!AB).)*$} $string
That matches any string that doesn't contain "AB" as a subsequence.
----

[elfring] 2003-10-29 TCL variables can be marked that an instance contains a compiled regular expression.
REs can be pre-compiled by the call "regexp $RE {}" [http://sourceforge.net/tracker/?group_id=10894&atid=360894&func=detail&aid=832230].

----

I would love to see a some clarification on exactly how non-reporting subpatterns work with -inline,
specifically if you can silence the overall pattern match:

 % set str { 
 
 asd;flkj <img src="example.jpg" > 
 sad;lfjl;kjf<IMg src="browser/ie.gif"> 
 asdflaj;lkfjasdf 
 lsdk 
  
 } 
 
 % set _Img {<img src="?([\w\./]*)"?[^>]*>} 
 <img src="?([\w\./]*)"?[^>]*> 
 % regexp -all -nocase -inline $_Img $str 
 {<img src="example.jpg" >} example.jpg {<IMg src="browser/ie.gif">} browser/ie.gif 

[glennj]: You can't silence the full match.  You will have to iterate over the results of regexp thusly:
  set matches [list]
  foreach {full submatch} [regexp -all -nocase -inline $_Img $str] {
      lappend matches $submatch
  }

----

[elfring] 2004-07-05 Does anybody know problems and solutions to match optional parts with regular expressions [http://groups.google.de/groups?group=comp.lang.tcl&selm=40ed1d8f.0407010130.3e899f5d%40posting.google.com]?


[MG] July 17th 2004 - The problem with the regexp there seems to be that one of the parts to match optional white space is in the wrong place, and is matching too much. If you use this regexp instead, it works for me, on Win XP with Tcl 8.4.6. (The change is that, after </S_URI> and before <P_URI>, the ''.*?'' has been moved inside the (?: ... )

 set pattern {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>(?:.*?<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}
 
 set string {<name>gruss</name>
 <scope>SYSTEM</scope>
 <S_URI>http://XXX/Hallo.dtd</S_URI>
 <P_URI>http://YYY/Leute.dtd</P_URI>
 <definition><!ELEMENT gruss (#PCDATA)></definition>
 <attributes>Versuch="1"</attributes>
 <content><h1>Guten Tag!</h1></content>}
 
 regexp $pattern $str z name scope system public definition attributes content

----

Regular Expression for parsing http string:
 
regexp {([[^:]]+)://([[^:/]]+)(:([[0-9]]+))} [ns_conn location] match protocol server x port

the above author should remember this is a TCL wiki, and not an [aolserver] one, but thanks for the submission ;)

----
'''Regular Expressions Cacheing'''

Tcl dynamically caches the compiled regular expressions. The Tcl core caches the last 30 REs it compiled but you can cause an number of RE's to be cached by assigning them to variables. If a regular expression is assigned to a variable and the variable is not changed, the Tcl core will save the compiled version of the RE and use the precompiled version of the variable during next evaluation. In the core the compiled version of the RE is stored in the Tcl_Obj, along with its string representation.

To find #pragma <something> statements define a pattern like

        set re {^\s*#\s*pragma\s+(.)}
        if { [regexp $re $line -> rest] } {
            ...
        }

The above example will cause the compiled regular expression to be stored in the '''re''' variable.

(From c.l.t [http://groups.google.com/groups?threadm=1l6lo01kqjm5vd0rl8b027p83rto660v7i%404ax.com])

The run time benefit of regular expression caching can easily be shown:

 # Run N different regexp patterns 
 proc test_regexps N {
   for {set i 0} {$i < $N} {incr i} {
     regexp "foobar$i" "foobar1"
   }
 }
 puts "29 Took: [time { test_regexps 29 } 100]"
 puts "30 Took: [time { test_regexps 30 } 100]"
 puts "31 Took: [time { test_regexps 31 } 100]"
 puts "32 Took: [time { test_regexps 32 } 100]"

One run of this gave:

 29 Took: 298 microseconds per iteration
 30 Took: 372 microseconds per iteration
 31 Took: 2000 microseconds per iteration
 32 Took: 2107 microseconds per iteration

...clearly showing the extra cost of having to recompile each regexp pattern each time thro' due to exceeding the NUM_REGEXPS (30).

----
See also:

   * [regsub]
   * [string match]

----
'''Using Regular Expressions to Strip Visually Blank Lines'''

''[DKF] writes that'' it is hard to do this with any single RE on its own, though you can do it quite easily using a couple of things coupled together. This example uses [regsub] to strip the problematic lines, but cannot completely get rid of leading and trailing newlines without the extra [string trim]:

   string trim [regsub -all {\n(?:\s*\n)+} $data \n] \n

However, I prefer selecting things positively, leading to a solution using [regexp] and [join]:

   join [regexp -all -inline {(?=[^\n]*\S)[^\n]+} $data] \n

[DKF] 10-Aug-2006: More experimentation indicates that a single [regsub] can do the whole job:

   regsub -all {^\n+|\n+$|(\n)+} $data {\1}

Note that the order of the alternatives is important!

----
'''A Regular Expression to Match Many Things in Any Order'''

[DKF]: Sometimes it is useful to be able to write a regular expression that matches a string that contains some number of substrings (typically words) in any order. In normal regexps, this is a horrible thing to write down as the size of the RE term varies exponentially with the number of substrings. However, if you don't mind matching behaviour that is guaranteed to be non-optimal in some strict sense, and if you don't want ''any'' capturing parens, you can use positive lookahead assertions to make things neater.

Thus, to match a string that contains '''foo''', '''bar''' and '''spong''' within it in any order, use a RE like this:

  set RE {(?=.*foo)(?=.*bar)(?=.*spong).}
  set matched [regexp $RE $string]

Just note that if you use this, you ''cannot'' know where those strings matched; lookahead assertions don't support that. If you need that data, use multiple [regexp] matches instead

----

[BAS] : just a tidbit, the [Postgresql] DBMS uses Tcl's regexp engine for its own regexp handling; see [http://archives.postgresql.org/pgsql-announce/2003-02/msg00008.php].
----
[[[Tcl syntax help]|[Arts and Crafts of Tcl-Tk Programming]|[Category Command] from [Tcl]|[Regular Expressions]|[Regular Expression Examples]|[Regular Expression Debugging Tips]|[Category String Processing]]]