Tcl regular expressions or Advanced Regular Expressions, implemented by Henry Spencer, are described below.
"Regular Expressions" is the name of an at-least-monthly column on scripting languages CL has co-authored from 1998 to 2009. See Cameron Laird's Personal Notes on Regular Expressions .
Jeffrey Friedl has written that "Tcl's [RE] engine is a hybrid with the best of both worlds."
See Henry Spencer's reply in tcl 8.2 regexp not doing non-greedy matching correctly , comp.lang.tcl, 1999-09-20.
Most common regular expression implementations (notably Perl and direct derivatives of the PCRE library) exhibit poor performance in certain pathological cases. Henry Spencer's complete reimplementation as a "hybrid" engine appears to address some of those problems. See Regular Expression Matching Can be Simple and Fast (but is slow in Java, Perl PHP, Python, Ruby,...) , Russ Cox, 2007-01, for some fascinating benchmarks.
Lars H: A very nice paper! Highly recommended for anyone interested in the internals of regular expression engines, and a good introduction to the theory.
The regular expressions accepted by regexp and regsub are called "Advanced Regular Expressions'', or ARE's, and are slightly different from the extended regular expressions's defined by POSIX.
An expression is composed of branches, and a branch is composed of atoms. Each type of atom is described below.
When there is more than one branch in an expression, the branch the results in the longest match is selected.
An atom is an indivisible component in an expression, which in turn is composed of zero or more atoms. The most simple pattern is a character that matches the same character within the text. Other atoms have special interpretations. For example, . matches a single character at the current position, whatever that character is, and \n matches one newline character. (hello) is an ateom that encapsulates an entire expression which matches he word, "hello", and also captures the matching text for later reuse.
The available atoms are:
A quantifier specifies how many times the preceding atom should match. A greedy quantifier matches as many times as possible and a thrifty quantifier matches as few as possible. A quantifer without an explicit preference uses the preference of its branch. If the branch has no preference, a quantifier sets the preference of the branch. Once the greediness for a branch is set, it does not change.
Quantifiers that may make their branch greedy:
Quantifiers that may make their branch thrifty, and that are thrifty in any case:
Examples:
A greedy .+:
% regexp a.+3 abc123abc123 var 1 % set var abc123abc123
A non-greedy .+?:
% regexp a.+?3 abc123abc123 var 1 % puts $var abc123
In the following example .* matches as much as possible, which is everything from the first <b> to the last </b>, inclusive.
% set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>} % regexp <b>.*</b> $str var % set var <b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>
The nongreedy variant matches from first <b> to first subsequent </b>, inclusive:
% set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>} % regexp <b>.*?</b> $str var % set var <b>Some Bold Text</b>
Since the greediness of a branch is configured by the first thing in it that has any greediness, placing a "no-op" atom like (:?)*? at the beginning of a branch ensures that it is non-greedy. In the following example, .* configures the branch to be non-greedy:
regexp -nocase -- {(:?)*?<b>.*</b>} $str var
See Also
One trick with non-greedy quantifiers is to anchor the expression to the beginning/end of the string, which has the effect of stretching the non-greedy match:
%regexp -inline (.*?)(n+)(.*) ennui en e n {} %regexp -inline ^(.*?)(n+)(.*)$ ennui ennui e nn ui
A constraint specifies how an atom may match within a string.
The lookahead constraints may not contain back references (see later), and all parentheses within them are considered non-capturing.
An RE may not end with \.
AMG: I wish Tcl had Vim-like \zs and \ze to specify where the match officially starts and ends within a regular expression. To demonstrate their use, first here's how \ze works to implement positive lookahead:
hello(?= world) = hello\ze world
But what if we wanted to match "world" only when preceded by "hello"? How can that be done in Tcl regular expressions? In Vim, it's dead simple:
hello \zsworld
In the general case, \zs and \ze allow the regular expression to specify the leading and/or trailing contexts under which it will match. This is simpler and more flexible than matching the whole thing (context and all), then postprocessing the result to pick out only the desired portion. The reason is the leading and trailing contexts of one match may overlap any part of another match. This gets tricky because the leading contexts of two matches may overlap, yet two match bodies may not overlap.
Here's a more general example that matches every instance of "x" that is both preceded and followed by "x":
x\zsx\zex
Given the input "xxxxyxxx", it would match at zero-based indexes 1, 2, and 6:
xxxxyxxx .12...6.
A sequence having the form (?metachars), directs the operation of the regular expression system. For example, (?i) makes all further matching insensitive to character case.
BAS : just a tidbit, Postgresql uses Tcl's regexp engine for its own regexp handling; see PostgreSQL Weekly News - Feb 12th 2003 .
AvL: I'll now mention some common pitfall with non-greedy REs: Let's go back to the first example, but with a modified string:
% regexp a.+?3 abc123ax3 var % set var abc123
Although second possible match ax3 would be shorter, abc123 is selected because even with non-greedy quantifiers, the first match always wins.
PYK 2019-08-15: Yes, greediness of an atom, sub-expression or branch does not change the rule that the earliest match is preferred.
The greediness of a branch applies to all quantifiers in the branch. In the following example, one might expect $last to contain the rest of the string, but it doesn't:
% regexp {([^\s]*?)(\s+)(.*)} {a large heart and a large head} full first delim last]} {a } a { } {}
(^\s*?) matched the first a, (\s+) matched the first whitespace character, and (.*) matched nothing. Why? The branch is non-greedy because the first thing with greediness, (^\s*?, is non-greedy, and therefore everything else in the branch is non-greedy as well.
Henry Spencer writes
>...You can't put extra spaces into regular >expressions to improve readability, you just have to suffer along >with the rest of us.
Actually, since 8.1 you can, although since it's one of 57 new features, it's easy to miss. Like so:
set re {(?x) \s+ ([[:graph:]]+) # first number \s+ ([[:graph:]]+) # second number } set data " -1.2117632E+00 -5.6254282E-01" regexp $re $data match matchX matchY
The initial (?x), which must be right at the start, puts the regexp parser into expanded mode, which ignores white space (with some specific exceptions) and #-to-end-of-line comments.
regexp and regsub store a compiled representation of the expression as its internal representation. In subsequent calls to regsub or regexp, this internal representation is used if available.
See also Feature Request #312: Function to Compile a Pattern , 2003
It isn't really necessary to "precompile" a regular expression, but this would have a similar effect:
regexp -about $RE
AMG: The only purpose I see for this is to detect syntax errors early rather than wait until the regular expression is used.
Some folks insist on inserting commas (or other characters) to format digits into groups of three. According to Keith Vetter, the Perl manual describes the following method:
1 while s/^([-+]?\d+)(\d{3})/$1,$2/;
This is a Tcl pre-8.1 translation:
set n 123456789.00 while {[regsub {^([-+]?[0-9]+)([0-9][0-9][0-9])} $n {\1,\2} n]} {} puts $n
results in
123,456,789.00
escape characters and quantifiers tighten this up a bit:
while {[regsub {^([-+]?\d+)(\d{3})} $n {\1,\2} n]} {}
Using the extended syntax makes it a little easier to read:
while {[regsub {(?x) ^([-+]?\d+) # The number at the start of the string... (\d{3}) # ...has three digits at the end } $n {\1,\2} n]} { # So we insert a comma there and repeat... }
RS: For a version with configurable separator, see Bag of algorithms, item "Number commified"
Ro: See also Human readable file size formatting for a version without regular expressions for those of us who are allergic to monstrous complexity ;)
NEM: A question in the Tcl Chatroom brought up a common problem that I've had when dealing with regular expressions. [^AB] means "not A or B", but what if you want to match anything but the string AB? The only way to do it is to put lots of negated classes one after the other, which is ugly. So, here is a way to wrap that up into something a bit more elegant:
proc not pattern { # Not capturing bracket set ret {(?:} foreach char [split $pattern {}] { append ret \[^$char\] } append ret ) return $ret }
Then you can do:
regexp -- "AB([not AB]*)AB(.*)" ABcdefghABijklmnopqrst -> first rest
$first is cdefgh and $rest is ijklmnopqrst.
It handles things like:
regexp -- "AB([not AB]*)AB(.*)" ABcdefghBAijkABlmnopqrst -> first rest
$first is cdefghBAijk and $rest is lmnopqrst.
But that this only matches match patterns that are at least the same length as the negated expression:
regexp -- "AB([not AB]*)AB(.*)" ABcABslkdjf -> first rest ;#-> 0
The proper solution to this problem is a lot more complex, unfortunately.
The above three expressions can be written using a lookahead constraint.
foreach str {ABcdefghABijklmnopqrst ABcdefghBAijkABlmnopqrst ABcABslkdjf} { set e "regexp -- {AB(?!AB)(.*)AB(.*)} $str -> first rest" puts "$e\n=> [eval $e]\nfirst = $first\nrest = $rest\n" }
Output:
regexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghABijklmnopqrst -> first rest #-> 1 #first -> cdefgh #rest -> ijklmnopqrst egexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghBAijkABlmnopqrst -> first rest #-> 1 #first -> cdefghBAijk #rest -> lmnopqrst regexp -- {AB(?!AB)(.*)AB(.*)} ABcABslkdjf -> first rest #-> 1 #first -> c #rest -> slkdjf
Important to note on the [not pattern] example above is that it does not match strings where there is an occurrence of the first letter from pattern when not part of the entirety of pattern:
% regexp -- "AB([not AB]*)AB(.*)" ABcdefghAijklmABnopqrst -> first rest 0 % regexp -- "AB([not AB]*)AB(.*)" ABcdefghBijklmABnopqrst -> first rest 1 % set first cdefghBijklm % set rest nopqrst
DKF: It's actually fairly easy to request that an RE shouldn't match something. You just need some magic around it like this:
regexp {^(?:(?!AB).)*$} $string
That matches any string that doesn't contain AB as a subsequence.
I would love to see a some clarification on exactly how non-reporting subpatterns work with -inline, specifically if you can silence the overall pattern match:
% set str { asd;flkj <img src="example.jpg" > sad;lfjl;kjf<IMg src="browser/ie.gif"> asdflaj;lkfjasdf lsdk } % set _Img {<img src="?([\w\./]*)"?[^>]*>} <img src="?([\w\./]*)"?[^>]*> % regexp -all -nocase -inline $_Img $str {<img src="example.jpg" >} example.jpg {<IMg src="browser/ie.gif">} browser/ie.gif
glennj: You can't silence the full match. You will have to iterate over the results of regexp:
set matches {} foreach {full submatch} [regexp -all -nocase -inline $_Img $str] { lappend matches $submatch }
elfring 2004-07-05: Does anybody know problems and solutions to match optional parts with regular expressions?
See Matching of optional parts in regular expressions , comp.lang.tcl, 2004-07-01.
MG 2004-07-17: The problem with the regexp there seems to be that one of the parts to match optional white space is in the wrong place, and is matching too much. If you use this regexp instead, it works for me, on Win XP with Tcl 8.4.6. (The change is that, after </S_URI> and before <P_URI>, .*? has been moved inside (?: ... )
set pattern {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>(?:.*?<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$} set string {<name>gruss</name> <scope>SYSTEM</scope> <S_URI>http://XXX/Hallo.dtd</S_URI> <P_URI>http://YYY/Leute.dtd</P_URI> <definition><!ELEMENT gruss (#PCDATA)></definition> <attributes>Versuch="1"</attributes> <content><h1>Guten Tag!</h1></content>} regexp $pattern $str z name scope system public definition attributes content
DKF: Sometimes it is useful to be able to write a regular expression that matches a string that contains some number of substrings (typically words) in any order. In normal regexps, this is a horrible thing to write down as the size of the RE term varies exponentially with the number of substrings. However, if you don't mind matching behaviour that is guaranteed to be non-optimal in some strict sense, and if you don't want any capturing parens, you can use positive lookahead assertions to make things neater.
Thus, to match a string that contains foo, bar and spong within it in any order, use a RE like this:
set RE {(?=.*foo)(?=.*bar)(?=.*spong).} set matched [regexp $RE $string]
Just note that if you use this, you cannot know where those strings matched; lookahead assertions don't support that. If you need that data, use multiple regexp matches instead
MAH: What yould be the correct way to loop over all matches of a regular expression in a string? I came up with the following solution
for finding all include statements in a string, but using -start has side effects on the meaning of characters like $ and ^.
set pos 0 while {[regexp -start $pos {`include "([\w/.]+)"} $data string vincfile]==1} { set pos [expr {$pos+[string length $string]}] puts "file=$vincfile" }
Lars H: The option combination -all -inline is probably what you're looking for (although in general the problem of "finding all matches" runs into several technical issues, due to the fact that matches may overlap).
In combination with -start, one has to use \A and \Z instead of $ and ^, unless the intent is to use the newline-sensitive behaviours of the latter. -indices may also be useful.
MAH: Okay, -inline is too clumsy for me since I don't want the overall match string. Instead I'll go with -indices. This gives me
set pos 0 while {[regexp -start $pos -indices {`include "([\w/.]+)"} $data -> vincfilepos]==1} { set vincfile [ string range $data [ lindex $vincfilepos 0 ] [ lindex $vincfilepos 1 ] ] set pos [ lindex $vincfilepos 1 ] puts "file=$vincfile" }
I think that's rather clumsy for a task this common. Any ideas on how to make it simpler?
PYK 2014-09-12:
set data { `include "one.h" `include "two.h" } foreach {all indices} [ regexp -all -inline -indices {`include "([\w/.]+)"} $data] { lassign $indices first last set vincfile [string range $data $first $last] puts file=$vincfile }
HE 2016-01-28: With the given regular expression I would use:
set data { `include "one.h" `include "two.h" } foreach {-> vincfile} [regexp -all -inline -- {`include "([\w/.]+)"} $data] { puts "file=$vincfile" }
PYK 2016-09-11: Reportable matches are indexed according to the order in which they appear lexically in the regular expression, and all reporting matches in all branches are indexed. Therefore, there's a trick to extracting a piece of data where branches are used to express the conditions the data might be found under. Since only one branch is represented in the reported matches, the reported matches from other branches are guaranteed to be the empty string, so the indexed matches can simply be concatenated. For example:
set url https://wiki.tcl-lang.org/461 set pattern {^(https?:\/\/wiki.tcl.tk\/)(?:.*\?redir=([0-9]+)|_\/edit\?N=([0-9]+)|([0-9]+))$} regexp $pattern $url -> prefix b1 b2 b3 set result $b1$b2$b3 ;# -> 461
Tcl switched to Advanced Regular Expressions in version 8.1.
The above discussion needs to cover the advanced regular expression syntax as handled by Tcl 8.1 and later and show the user what all the differences are, so that one can write portable code when necessary - or at least create appropriate package require statements.
KBK has astutely remarked that, "Much of the art of designing recognizers is the art of controlling such things in the common cases; regexp matching in general is PSPACE-complete, and our extended regexps are even worse (... not bounded by any tower of exponentials ...)." [L1 ]
Lars H 2008-06-01: Somehow I doubt KBK would say that, in part because it's dead wrong as far as basic regular expressions are concerned — given a regular expression of size m and a string of size n it is always possible to test whether the string matches that regular expression in time that is linear in n and polynomial in m. Googling for "regexp matching PSPACE complete" turns up this page, but otherwise rather suggests that other problems concerning regular expressions, in particular deciding whether two regular expressions are equivalent, may be PSPACE-complete. (Which is actually kind of interesting, since the naive determinization algorithm for this might need exponential amounts of memory and thus not be in PSPACE at all, but off-topic.)
The link provided as source currently doesn't work (no surprise, it's into SourceForge mail archives), but the forum_id seems to refer to development of a Perl module (text::similarity, in some capitalization) rather than anything Tcl related. That matching using Perl's so-called "regexps" should be "worse than PSPACE-complete" is something I can believe, so in that context the quote makes sense, but why it should then be attributed to KBK, and moreover why it should appear in this Wiki (added 2006-08-09, in revision 35 of this page), is still a mystery.
LV: Searching http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/tcl-core (well, actually from what I can see, one can only search ALL of activestate's mailing list archives), doesn't turn up a reference like this. Maybe it is quite old - before activestate?
TV: Auw, man... That´s like suggesting something like the traveling salesman problem is only there to upset people that a certain repository will have the perfect solution for this type of problem, but like that the actual worlds´ best sorting algorithm (O(log(2.1...)) has gotten lost in a van with computer tapes from some university in 1984 or so, the whole of "datastructures and algorithms" will end up like the ´English IT Show´ on the Comedy Channel, and than on the Who says Tcl sucks... graveyard like the connection machine was great but forgotten and the world´s greatest synthesizer developers/researchers are in "The Dead Presidents Society" (CEO´s that is, like ´the dead Poets Society´).