Purpose: Describe Tcl Regular Expressions, emphasising advanced features.

----
See http://www.tcl.tk/doc/howto/regexp81.html - ''New Regular Expression Features in Tcl 8.1''
----

A '''regular expression''' is a technique of describing a ''pattern'' that you are seeking in a string (used mainly by Tcl's [regexp] and [regsub] commands).
One of the best resources for understanding regular expressions is the [O'Reilly] [BOOK Mastering Regular Expressions], which talks about many of the uses of regular expressions, 
from the Unix '''grep'''(1) command to Tcl and beyond.

In the following examples, regular expressions and strings will be listed inside {} .

An analogy that might make regular expressions easier is to think of them in ''chemistry'' sense.
One starts with '''atoms''' - the smallest building blocks of uniqueness in regular expressions.

A regular expression atom is made of either a '''literal character''' or a '''metacharacter'''.

A literal character is the simplest regular expression possible.  
For example, the string {a} is a one character regular expression.  
It can be used to match a portion of any string which contains the letter a.  
Compare the regular expression {a} against the string "abc" and you get a match.  
Compare {a} against the string "xyz" and you do not get a match.

A metacharacter is the means for telling regular expression what you want in a bit more vague manner.  
For instance, the metacharacter {.} means 'match any character'.  
Compare a period {.} against any string one character or longer and you get a match.

Another metacharacter is the {\\}.  The backslash metacharacter tells the regular expression 
that the following character is to be used literally. 
This comes in most helpfully when attempting to describe patterns containing metacharacters.

The following applies to Tcl 8.1 or newer.

Regular expression atoms and metacharacters fall into one of several classes. 
The first type express a specific character is to be matched.

   literal:   Any alphabetic, numeric, white space character are frequently treated as literal matches.  However, there are a few cases, detailed below, where they are used in a metacharacter construct.

   [[characters]]:   The notation here defines a subset of characters to match.  An exclusive match is one in which the first character inside the matching braces is the caret (^) character.  
If this character is the first in the subset, then the following characters define a subset of characters which must not match.  
Otherwise without the leading caret, the characters define a subset of characters which must match.  
The subset is defined as either individual characters or ranges of characters separated by a minus sign (-).

   .:   A period matches any literal character

   \k:   When '''k''' is non-alphanumeric, the atom matches the literal character '''k'''


   \c:   When '''c''' is alphanumeric (possibly followed by other characters), the sequence is called a [Regular Expression Escape Sequence]

The second type are modifiers to regular expression atoms, expressing the quantity of characters to be matched.  These are called quantified atoms.

   *:   The largest series (zero or more occurances) of the '''preceeding''' regular expression atom will be matched.
   +:   The largest series (one or more occurances) of the '''preceeding''' regular expression atom will be matched.
   ?:   This is a boolean type quantifier - it means the atom may or may not appear (i.e. it may appear 0 or 1 times).
   {m}:   a sequence of exactly m matches of the atom
   {m,}:   a sequence of m or more matches of the atom
   {m,n}:   a sequence of no less than m and no more than n matches of the atom

   *?:   non-greedy form of * quantifier - if there is more than one match, selects the smallest of the matches
   +?:   non-greedy form of + quantifier
   ??:   non-greedy form of ? quantifier
   {m}?:   non-greedy form of {m} quantifier

The third type of regular expression metacharacters modify a regular expression atom by placing constraints upon the characters being matched.

   ^:   The '''following''' regular expression will only match when it occurs at the beginning of a string.
   $:   The '''preceeding''' regular expression will only match when it occurs at the ''end'' of a string.  While it is common to think of this character matching the newline, note that one cannot manipulate the newline by for instance trying to replace the symbol by a null string, etc.

The fourth type of regular expression metacharacters are used for expressing groupings.

(.RE-a.) --   Parens surrounding a series of regular expression atoms (with possible modifiers) are treated as an entity to be matched.  The results are returned, if the invocation provides for matches to be returned in a variable.

(?:.RE-a.) --  This modified version of grouping treats the enclosed regular expression atoms as an entity, but does not return the results in a match variable.

() --   This matches an empty string, returning the match in a variable.

(?:) --   This matches an empty string, but does not return the results in a match variable.

.RE-a.|.RE-b. --   The vertical bar/or symbol/pipe symbol is a metacharacter used to separate regular expression atoms.  The resulting regular expression means "match RE-a OR RE-b".  
Each side of the symbol is called a ''branch''.  
Each branch consists of zero or more constraints or quantified atoms, concatenated.  
Empty branches match an empty string.


----
One must be aware that regular expression are either greedy or non-greedy, regardless of your mixture of greedy/non-greedy metacharacters.  Refer to this [http://groups.google.com/groups?th=cda7ef577e79b545] comp.lang.tcl thread, and specifically this [http://groups.google.com/groups?hl=en&selm=FIECG4.F75%40spsystems.net] Sept. 1999 posting from Henry Spenter to c.l.t.
----

'''Comma Number Formatting'''

Some folks insist on inserting commas (or other characters) to format digits into groups of three.  
Here is a regexp to do the trick from [Keith Vetter].  (Thanks Keith!)  
The Perl manual describes a very slick method of doing this:

     1 while s/^([-+]?\d+)(\d{3})/$1,$2/;

Translated into (pre 8.1) tcl you get:

    set n 123456789.00
    while {[regsub {^([-+]?[0-9]+)([0-9][0-9][0-9])} $n {\1,\2} n]} {}
    puts $n

results in

    123,456,789.00

(You can tighten this up a little using Tcl 8.1's regular expressions:

    while {[regsub {^([-+]?\d+)(\d{3})} $n {\1,\2} n]} {}

Using the extended syntax, this becomes a bit easier to understand:

    while {[regsub {(?x)
        ^([-+]?\d+)     # The number at the start of the string...
        (\d{3})         # ...has three digits at the end
    } $n {\1,\2} n]} {
                        # So we insert a comma there and repeat...
    }

)

For a version with configurable separator, see [Bag of algorithms],
item "Number commified" - ''RS''

----

[Henry Spencer] writes

 >...You can't put extra spaces into regular
 >expressions to improve readability, you just have to suffer along
 >with the rest of us.

Actually, since 8.1 you can, although since it's one of ''57'' new features,
it's easy to miss.  Like so:

 set re {(?x)
        \s+ ([[:graph:]]+)      # first number
        \s+ ([[:graph:]]+)      # second number
 }
 set data "     -1.2117632E+00     -5.6254282E-01"
 regexp $re $data match matchX matchY

The initial "(?x)" (which must be right at the start) puts the regexp
parser into expanded mode, which ignores white space (with some specific
exceptions) and #-to-end-of-line comments.

----
More information is available in the Tcl manual page on regular expressions.
You can view two of the pages at:

http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/re_syntax.htm and
http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/regexp.htm .

Chapter 11 in Brent Welch's book also covers regular expressions.
Information on the book can be found at
http://www.beedub.com/book/3rd .  And older version of the chapter,
which won't cover the most recent developments, can be found at
http://www.beedub.com/book/2nd/regexp.doc.html#236 .

----
The above discussion needs to cover the advanced regular expression syntax as handled by Tcl 8.1 and later and show the user what all the differences are, so that one can write portable code when necessary - or at least create appropriate package require statements.

----

Another useful place to learn about Regular Expressions is the page at the Tcl Developer's Xchange at
http://www.purl.org/tcl/home/doc/howto/regexp81.html , where info on the
Tcl 8.x specific features are discussed.


----
Some algorithms are easier coded withOUT REs.  Tcl's [string] command is 
versatile, and often simplifies problems many programmers hit with the RE hammer.


----
[[Explain [Komodo] RE debugger.]]

----
[tkWorld] contains tkREM which is a regular expression maker.
Perhaps someone familar with it would like to discuss it.

[^txt2regex$] http://txt2regex.sourceforge.net
is a Regular expression wizard written in bash2 that converts human sentences into regular expressions.  
It can be used to build up regular expressions suitable for use in Tcl.

Visual REGEXP http://laurent.riesterer.free.fr/regexp/ is software to help you debug regular expressions.

See [redet] for another tool to assist in developing regular expressions.
----
If someone is still '''stuck''' using Tcl 8.0.x, you might take a look at ftp://ftp.procplace.com/pub/tcl/sorted/packages-7.6/devel/nre30.tar.gz which is one of a couple extensions back then that provided a superset of regular expression functionality.  
Unfortunately, this does not provide all the power of Tcl 8.1 and newer, 
but at least it is more than was available before 8.1.

tcLex [http://www.multimania.com/fbonnet/Tcl/tcLex/index.en.htm] is a lexical analyzer which uses Tcl regular expressions to do the matching.

[Yeti] is another lexical analyser, parser generator.


----
Tcl's regular expression engine is an interesting and subtle object
for study in its own regard.  While [Perl] is the language that
deserves its close identification with RE capabilities, Tcl's engine
competes well with it and every other one.  In fact, although he doesn't 
favor Tcl as a language, RE expert [Jeffrey Friedl] has written
[http://www.oreillynet.com/pub/a/network/2002/07/15/regexp.html]
that 
"Tcl's [[RE]] engine is a hybrid with the best of both worlds."

For more on different engines, see Henry's comments in 
[http://groups.google.com/groups?selm=FIECG4.F75%40spsystems.net].
----
Yet another meaning of "Regular Expressions":  the name of an
at-least-monthly column on scripting languages [CL] has co-authored since 1998
[http://www.regularexpressions.com/].


----

TCL variables can be marked that an instance contains a compiled regular expression.
REs can be pre-compiled by the call "regexp $RE {}" [http://sourceforge.net/tracker/?group_id=10894&atid=360894&func=detail&aid=832230].

[DKF]: I prefer to use '''regexp -about $RE''' to do the compilation, but that's probably a matter of style.

----
[KBK] has astutely remarked that, "Much of the art of designing recognizers is             
the art of controlling such things in the common cases; regexp matching         
in general is PSPACE-complete, and our extended regexps are even worse
(... not bounded by any tower of exponentials ...)."
[http://sourceforge.net/mailarchive/forum.php?thread_id=29946064&forum_id=6718]

----

See also:

   * [Regular Expression Examples] 
   * [Beginning Regular Expressions]
   * [Regular Expression Debugging Tips]
   * [re_syntax]
   * [Drawbacks of Tcl's Regexps]
   * a thorough tutorial with examples [http://www.regular-expressions.info/tutorial.html] (although its imprecisions exasperate [CL])
   * an extensive library of regular expressions for various tasks [http://regular-expressions.info/examples.html]
   * an article on five habits for regular expression development [http://www.onlamp.com/pub/a/onlamp/2003/08/21/regexp.html]

----
[[Refs to [Henry Spencer] and Kleene [http://www.library.wisc.edu/libraries/Math/kleene.htm].]
----
[Category Tutorial] -
[Arts and crafts of Tcl-Tk programming] -
[Category String Processing]