Tcl '''Regular Expressions''', implemented by [Henry Spencer], are called
[ARE%|%Advanced Regular Expressions].


** Disambiguation **

"Regular Expressions" is the name of an at-least-monthly column on scripting
languages [CL] has co-authored from 1998 to 2009.  See
[http://www.regularexpressions.com/%|%Cameron Laird's Personal Notes on Regular
Expressions].


** See Also **

   [re_syntax]:   

   [http://www.tcl.tk/doc/howto/regexp81.html%|%New Regular Expression Features in Tcl 8.1]:   

   [regular expression]:   about regular expressions in general, including features not found in Tcl, and the theory behind them.

   [Beginning Regular Expressions]:   

   [Regular Expression Examples]:   

   [Advanced Regular Expression Examples]:   

   [Regular Expression Debugging Tips]:   
   
   [https://groups.google.com/d/msg/comp.compilers/8oYjzrXhIEk/iZ3h0krmv0gJ%|%Henry Spencer's "Tcl" Regex Library] ,comp.compilers, 2007-10-01:   

   [http://www.arglist.com/regex%|%regex - Henry Spencer's regular expression libraries]:   links to Henry Spencer's original release in 1994, "regex3.8a.tar.gz", that was included in 4.4 BSD Unix, Walter Waldo's port of Henry's Tcl regex in Tcl 8.1, and Thomas Lackener's port from Tcl-8.5a3

   [http://beedub.com/book/3rd/regexp.pdf%|%Chapter 11: Regular Expressions] ,[Book Practical Programming in Tcl and Tk] , 3rd Edition, by [Brent Welch]:   

   [http://web.archive.org/web/20080609172251/http://www.unixreview.com/documents/s=10121/ur0702e/%|%Tcl Scores High in RE Performance] ,[Cameron Laird] and Kathryn Soraiz ,2007-02

   [Drawbacks of Tcl's Regexps]:   

   [http://www.reddit.com/r/programming/comments/68280/tcl_regex_implementation_beats_whole_competition/%|%TCL regex implementation beats whole competition. Even C with PCRE and Boost regex, (shootout.alioth.debian.org) ,reddit.com ,2006-02-07] ,2008-02-07

   [http://perldoc.perl.org/perlreguts.html#Unicode-and-Localisation-Support%|%Unicode and Localisation Support]:   interesting in a Tcl context because it illustrates the [Perl] regular expressions, being derived from [Henry Spencer%|%Spencer's] earlier regular expression attempts, suffer [Unicode] issues that Tcl doesn't

   [http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html%|%Regular Expressions: Now You Have Two Problems] ,Jeff Atwood ,2008-06-27:   some pulpit-thumping about regular expressions


** Resources **

   [Komodo]:   includes a regular expression debugger

   [Visual REGEXP]: 

   [http://laurent.riesterer.free.fr/regexp/%|%Visual REGEXP]:   a regular expression toolkit

   [tkWorld]:   contains tkREM which is a regular expression maker.  Perhaps someone familar with it would like to discuss it.

   [http://txt2regex.sourceforge.net%|%^txt2regex$]: a Regular expression wizard written in bash2 that converts human sentences into regular expressions.  It can be used to build up regular expressions suitable for use in Tcl.

   [redet]:   another tool to assist in developing regular expressions

   [ftp://ftp.tcl.tk/pub/tcl/all/n/nre/3.0/nre30.tar.gz%|nre30.tar.gz]:   one of a couple extensions prior to Tcl 8.1 that that provided a superset of regular expression functionality.  This does not provide all the power of Tcl 8.1 and newer, but at least it is more than was available before 8.1.

   [tcLex]:   a lexical analyzer which uses Tcl regular expressions to do the matching

   [Yeti]:   another lexical analyser, parser generator.


** Design **

Tcl's regular expression engine is an interesting and subtle object for study
in its own regard.  While [Perl] is the language that deserves its close
identification with RE capabilities, Tcl's engine competes well with it and
every other one.  In fact, although he doesn't favor Tcl as a language, RE
expert [Jeffrey Friedl] has written
[http://www.oreillynet.com/pub/a/network/2002/07/15/regexp.html] that "Tcl's
[[RE]] engine is a hybrid with the best of both worlds."

For more on different engines, see Henry's comments in
[http://groups.google.com/groups?selm=FIECG4.F75%40spsystems.net].

Most common regular expression implementations (notable perl and direct derivatives 
of the [PCRE] library) exhibit poor performance in certain
pathological cases.  Henry Spencer's complete reimplementation as a "hybrid" engine
appears to address some of those problems.  See [http://swtch.com/~rsc/regexp/regexp1.html]
for some fascinating benchmarks.

[Lars H]: A very nice paper! Highly recommended for anyone interested in the
internals of regular expression engines, and a good introduction to the theory.


** Description **

[regexp] and [regsub] accept argumetns that are Tcl regular expressions. 

The following applies to Tcl 8.1 or newer.

Regular expression atoms and metacharacters fall into one of several classes.
The first type express a specific character is to be matched.

   '''literal''':   Any alphabetic, numeric, white space character are frequently treated as literal matches.  However, there are a few cases, detailed below, where they are used in a metacharacter construct.

   `[`''characters''`]`:   means "match any of the characters in the brackets."  If the first character inside the braces is caret (`^`), the meaning is changed to "match any character not in the brackets".  A minus sign (`-`) between any two characters in the brackets means "all characters between these two."

   .:   matches any literal character

   \k:   When '''k''' is non-alphanumeric, the atom matches the literal character '''k'''


   \c:   When '''c''' is alphanumeric (possibly followed by other characters), the sequence is called a [Regular Expression Escape Sequence]

The second type are modifiers to regular expression atoms, expressing the
quantity of characters to be matched.  These are called quantified atoms.

   `*`:   The largest series (zero or more occurances) of the '''preceeding''' regular expression atom will be matched.
   `+`:   The largest series (one or more occurances) of the '''preceeding''' regular expression atom will be matched.
   `?`:   This is a boolean type quantifier - it means the atom may or may not appear (i.e. it may appear 0 or 1 times).
   `{m}`:   a sequence of exactly m matches of the atom
   `{m,}`:   a sequence of m or more matches of the atom
   `{m,n}`:   a sequence of no less than m and no more than n matches of the atom. [LV] In Tcl, this has an upper limit of 255.

   `*?`:   non-greedy form of * quantifier - if there is more than one match, selects the smallest of the matches
   `+?`:   non-greedy form of + quantifier
   `??`:   non-greedy form of ? quantifier
   `{m}?`:   non-greedy form of {m} quantifier

The third type of regular expression metacharacters modify a regular expression
atom by placing constraints upon the characters being matched.

   `^`:   Matches the beginning of a string.
   `$`:   Matches the end of the string.   While it is common to think of this character matching the newline, it actually just matches the end of the string, not some character at the end, and one cannot manipulate the newline by for instance trying to replace the symbol by a null string, etc.

The fourth type of regular expression metacharacters are used for expressing
groupings.

   `(`''regular expression'')`:   Parenthesis surrounding one ore more regular expressions specify a nested regular expression or choice of several regular expressions.   The substring matching the nested regular expression is captured and can be referred to via the ''back reference'' mechanism, and alo captured into the corresponding match variable specified as an arbument to the command.

   `(?:`''regular expression`)`:   Specifies that any match against the grouped ''regular expression'' will not be captured as a back reference or into a match variable.

   `()`:   This matches an empty string, returning the match in a variable.

   `(?:)`:   matches an empty string, but does not return the results in a match variable.

   ''regular expression''`|`''regular expression'':   The '''vertical bar''' ('''pipe''') character indicates a choice between the ''regulare expressions on either side.  In other words, "match RE-a OR RE-b".  Each side of the symbol is called a ''branch''.  Each branch is an independent regular expression.  Empty branches match an empty string.


** Other Users of Tcl Regular Expressions **

[BAS] : just a tidbit, the [Postgresql] DBMS uses Tcl's regexp engine for its
own regexp handling; see

[http://archives.postgresql.org/pgsql-announce/2003-02/msg00008.php].


** Greedy vs Non-greedy **

Regular expression are either greedy or non-greedy, regardless of your mixture
of greedy/non-greedy metacharacters.  Refer to
[http://groups.google.com/d/msg/comp.lang.tcl/GzN9oXAbyK0/3xZpxLpI2dEJ%|%Regexp:
Matching pairs of characters] ,[comp.lang.tcl] ,2001-11-28 , and to
[http://groups.google.com/d/msg/comp.lang.tcl/FddeFPbTFw8/asoMuv7dWqIJ%|%tcl
8.2 regexp not doing non-greedy matching correctly] ,[comp.lang.tcl]
,1999-09-20


[MG]:  OK, I'm sure someone can do this better than me, but since nothing's
here at the moment I'll make a start...

By default, the regexp characters `+` and `*` match as much as possible (which
is called greedy matching). By placing a ''?'' after them, you can make them
match as little as possible (non-greedy). For example...

======
regexp a.+3 abc123abc123 var
set var
======

output:

======
abc123abc123
======

`.+` matched all the characters up until the last 3. In contrast:

======
regexp a.+?3 abc123abc123 var
puts $var
======

output:

======non
abc123
======

because `+?` matched as little as possible.

Greedy regexp matching is a particular problem in parsing HTML, etc, because...

======
set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>}
regexp <b>.*</b> $str var
puts $var
======

output:

======
Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold
Text
======

`.*` matched as much as possible. It began matching takes at the first
occurance of `<b>` and didn't quit until the ''last'' occurance of `</b>`. But,
using a non-greedy regexp to match...

======
set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>}
regexp <b>.*?</b> $str var
puts $var
======

output:

======
Some Bold Text
======

Hope that explanation/rambling is some use, at least until someone with more
idea what they're doing puts something up :)

[AvL] I'll now mention some common pitfall with non-greedy REs: Lets go back to
the first example, but with a modified string:

======
regexp "a.+?3" "abc123ax3" var
set var
======

Although second possible match `ax3` would be shorter, it will still find the
first match `abc123`, because even with non-greedy quantifiers, the first match
always wins.


** Expanded Syntax **


[Henry Spencer] writes

======none
 >...You can't put extra spaces into regular
 >expressions to improve readability, you just have to suffer along
 >with the rest of us.
======

Actually, since 8.1 you can, although since it's one of ''57'' new features,
it's easy to miss.  Like so:

======
set re {(?x)
    \s+ ([[:graph:]]+)      # first number
    \s+ ([[:graph:]]+)      # second number
}
set data "     -1.2117632E+00     -5.6254282E-01"
regexp $re $data match matchX matchY
======

The initial `(?x)`, which must be right at the start, puts the regexp parser
into expanded mode, which ignores white space (with some specific exceptions)
and #-to-end-of-line comments.


** Compiling Regular Expressions **

The first time a variable passed as a regular expression of [[`[regexp]`] or
[[`[regsub]`], it is compiled, and the compiled value is cached in the internal
variable structure.  To compile a regulare expression:

======
regexp $RE {}
======

[http://sourceforge.net/p/tcl/feature-requests/312/%|%Feature Request #312:
Function to Compile a Pattern] ,2003

[DKF]: I prefer to use

======
regexp -about $RE
======

to do the compilation, but that's probably a matter of style.


** Comma Number Formatting **

Some folks insist on inserting commas (or other characters) to format digits
into groups of three.  Here is a regexp to do the trick from [Keith Vetter].
(Thanks Keith!)  The Perl manual describes a very slick method of doing this:

======none
1 while s/^([-+]?\d+)(\d{3})/$1,$2/;
======

Translated into (pre 8.1) tcl you get:

======
set n 123456789.00
while {[regsub {^([-+]?[0-9]+)([0-9][0-9][0-9])} $n {\1,\2} n]} {}
puts $n
======

results in

======none
123,456,789.00
======

You can tighten this up a little using Tcl 8.1's regular expressions:

======
while {[regsub {^([-+]?\d+)(\d{3})} $n {\1,\2} n]} {}
======

Using the extended syntax, this becomes a bit easier to understand:

======
while {[regsub {(?x)
        ^([-+]?\d+)     # The number at the start of the string...
        (\d{3})         # ...has three digits at the end
} $n {\1,\2} n]} {
                                        # So we insert a comma there and repeat...
}
======

[RS]: For a version with configurable separator, see [Bag of algorithms], item
"Number commified"

[Ro]: See also [Human readable file size formatting] for a version without regular expressions for those of us who are allergic to monstrous complexity ;)


** Misc **


The above discussion needs to cover the advanced regular expression syntax as
handled by Tcl 8.1 and later and show the user what all the differences are, so
that one can write portable code when necessary - or at least create
appropriate package require statements.

----

[KBK] has astutely remarked that, "Much of the art of designing recognizers is
the art of controlling such things in the common cases; regexp matching
in general is PSPACE-complete, and our extended regexps are even worse (... not
bounded by any tower of exponentials ...)."
[http://sourceforge.net/mailarchive/forum.php?thread_id=29946064&forum_id=6718]

[Lars H] 2008-06-01: Somehow I doubt [KBK] would say that, in part because it's
'''dead wrong''' as far as basic [regular expression]s are concerned — given a
regular expression of size ''m'' and a string of size ''n'' it is always
possible to test whether the string matches that regular expression in time
that is linear in ''n'' and polynomial in ''m''. Googling for "regexp matching
PSPACE complete" turns up this page, but otherwise rather suggests that ''other
problems'' concerning regular expressions, in particular deciding whether two
regular expressions are equivalent, may be PSPACE-complete. (Which is actually
kind of interesting, since the naive determinization algorithm for this might
need exponential amounts of memory and thus not be in PSPACE at all, but
off-topic.)

The link provided as source currently doesn't work (no surprise, it's into
[SourceForge] mail archives), but the forum_id seems to refer to development of
a [Perl] module (text::similarity, in some capitalization) rather than anything
Tcl related. That matching using Perl's so-called "regexps" should be "worse
than PSPACE-complete" is something I can believe, so in that context the quote
makes sense, but why it should then be attributed to [KBK], and moreover why it
should appear in this Wiki (added 2006-08-09, in revision 35 of this page), is
still a mystery.

[LV]:  Searching http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/tcl-core
(well, actually from what I can see, one can only search ALL of activestate's
mailing list archives), doesn't turn up a reference like this. Maybe it is
quite old - before activestate?

[TV]: Auw, man... That´s like suggesting something like the [traveling salesman
problem] is only there to upset people that a certain repository will have the
perfect solution for this type of problem, but like that the actual worlds´
best sorting algorithm (O(log(2.1...)) has gotten lost in a van with computer
tapes from some university in 1984 or so, the whole of "datastructures and
algorithms" will end up like the ´English IT Show´ on the Comedy Channel, and
than on the [Who says Tcl sucks...] graveyard like the [connection machine] was
great but forgotten and the world´s greatest synthesizer developers/researchers
are in "The Dead Presidents Society" (CEO´s that is, like ´the dead Poets
Society´).


<<categories>> Arts and crafts of Tcl-Tk programming | Tutorial | String Processing