Regular Expressions

Difference between version 105 and 106 - Previous - Next
Tcl '''[regular expression%|%regular expressions]''' or [ARE%|%Advanced Regular
Expressions], implemented by [Henry Spencer], are described below. 



** Disambiguation **

"Regular Expressions" is the name of an at-least-monthly column on scripting
languages [CL] has co-authored from 1998 to 2009.  See
[http://www.regularexpressions.com/%|%Cameron Laird's Personal Notes on Regular
Expressions].



** Documentation **

   [re_syntax]:   The official reference manual for Tcl regular expression syntax.



** Tcl Commands **

   [regexp]:   

   [regsub]:   



** See Also **

   [http://tclstuff.ucoz.net/regex.html%|%Regex]:   New regular expression engine by [Stefan K] made to match [UTF-8] strings directly.

   [regmap], by [SS]:   Applies scripts to matching substrings.

   [regular expression]:   About regular expressions in general, including the theory behind them and features not found in Tcl.


   [http://www.tcl.tk/doc/howto/regexp81.html%|%New Regular Expression Features in Tcl 8.1]:   

   [Beginning Regular Expressions]:   

   [Regular Expression Examples]:   

   [scan]:   A routine that can be used surprisingly often instead of using a regular expression.  For example, to extract the first natural decimal number from a string:  `scan abc123 {%[[^0-9]]%lld}`

   [Advanced Regular Expression Examples]:   

   [Regular Expression Debugging Tips]:   
   
   [https://groups.google.com/d/msg/comp.compilers/8oYjzrXhIEk/iZ3h0krmv0gJ%|%Henry Spencer's "Tcl" Regex Library], comp.compilers, 2007-10-01:   

   [http://www.arglist.com/regex%|%regex - Henry Spencer's regular expression libraries]:   Links to Henry Spencer's original release in 1994, "regex3.8a.tar.gz", that was included in 4.4 BSD Unix, Walter Waldo's port of Henry's Tcl regex in Tcl 8.1, and Thomas Lackener's port from Tcl-8.5a3.

   [http://beedub.com/book/3rd/regexp.pdf%|%Chapter 11: Regular Expressions], [Book Practical Programming in Tcl and Tk], 3rd Edition, by [Brent Welch]:   

   [http://web.archive.org/web/20080609172251/http://www.unixreview.com/documents/s=10121/ur0702e/%|%Tcl Scores High in RE Performance], [Cameron Laird] and Kathryn Soraiz, 2007-02:   

   [Drawbacks of Tcl's Regexps]:   
   ][Regular Expressions Match Requirements]:   A set of expressions that describe requirements for an improved regular expression matcher.

   [http://www.reddit.com/r/programming/comments/68280/tcl_regex_implementation_beats_whole_competition/%|%TCL regex implementation beats whole competition. Even C with PCRE and Boost regex, (shootout.alioth.debian.org)], reddit.com, 2008-02-07:   

   [http://perldoc.perl.org/perlreguts.html#Unicode-and-Localisation-Support%|%Unicode and Localisation Support]:   Illustrates that [Perl] regular expressions, being derived from [Henry Spencer%|%Spencer's] earlier regular expression attempts, suffer [Unicode] issues that Tcl doesn't.

   [http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html%|%Regular Expressions: Now You Have Two Problems], Jeff Atwood, 2008-06-27:   Some pulpit-thumping about regular expressions.



** Resources **

   [Komodo]:   Provides a regular expression debugger.

   [Visual REGEXP]:   A visual regular expression development and test utility.

   [tkWorld]:   Provides tkREM, a regular expression maker.

   [http://txt2regex.sourceforge.net%|%^txt2regex$]:   A Regular expression wizard written in [bash] that converts human sentences into regular expressions.  It can be used to build up regular expressions suitable for use in Tcl.

   [redet]:   A tool to assist in developing regular expressions.

   [ftp://ftp.tcl.tk/pub/tcl/all/n/nre/3.0/nre30.tar.gz%|%nre30.tar.gz]:   One of a couple extensions prior to Tcl 8.1 that that provided a superset of regular expression functionality.  This does not provide all the power of Tcl 8.1 and newer, but at least it is more than was available before 8.1.

   [tcLex]:   A lexical analyzer which uses Tcl regular expressions to do the matching.

   [Yeti] and [Ylex]:   Another lexical analyser and parser generator.



** Design **

[Jeffrey E. F. Friedl%|%Jeffrey Friedl]
[http://www.oreillynet.com/pub/a/network/2002/07/15/regexp.html%|%has written]
that "Tcl's [[RE] engine is a hybrid with the best of both worlds."

See [Henry Spencer]'s reply in
[http://groups.google.com/d/msg/comp.lang.tcl/FddeFPbTFw8/asoMuv7dWqIJ%|%tcl
8.2 regexp not doing non-greedy matching correctly], [comp.lang.tcl], 1999-09-20.

Most common regular expression implementations (notably [Perl] and direct
derivatives of the [PCRE] library) exhibit poor performance in certain
pathological cases.  Henry Spencer's complete reimplementation as a "hybrid"
engine appears to address some of those problems.  See
[http://swtch.com/~rsc/regexp/regexp1.html%|%Regular Expression Matching Can be
Simple and Fast (but is slow in Java, Perl PHP, Python, Ruby,...)], Russ Cox, 2007-01, for some fascinating benchmarks.

[Lars H]: A very nice paper! Highly recommended for anyone interested in the
internals of regular expression engines, and a good introduction to the theory.



** Description **

The regular expressions accepted by `[regexp]` and `[regsub]` are called
"Advanced Regular Expressions'', or '''ARE''''s, and are slightly different
from the '''[regular expression%|%extended regular expressions]''''s defined by
[POSIX].

An '''expression''' is composed of branches, and a '''branch''' is composed of
atoms.  Each type of '''atom''' is described below.

When there is more than one branch in an expression, the branch the results in
the longest match is selected.



** Atoms **

An atom is an indivisible component in an expression, which in turn is composed
of zero or more atoms.  The most simple pattern is a character that matches the
same character within the text.  Other atoms have special interpretations.  For
example, `.` matches a single character at the current position, whatever that
character is, and  `\n` matches one newline character.  `(hello)` is an ateom
that encapsulates an entire expression which matches he word, "hello", and also
captures the matching text for later reuse.

The available atoms are:

   ''x'':   A character that has no other interpretation matches itself. 

   `.`:   Matches any character.

   `\`''k'', where ''k'' is a single non-alphanumeric character:   Matches the literal character ''k''.


   `\`''c'', where  ''c'' is one of the enumerated escape sequences:   Matches what it is defined to match in [Regular Expression Escape Sequence].

   `[[`''characters''`]]`:   Matches any character in the set of ''characters''.  If the first character in ''characters'' is `^` (caret), then only characters not in ''characters'' match.  `x-y` specifies a range of characters, where ''x'' is the character at the lowest bound, and ''y'' is the character at the highest bound.  In contrast with other regular expression engines, when `\` occurs in ''characters'', it does not lose its special meaning.

   `()`:   Matches an empty string, capturing it.

   `(?:)`:   Matches an empty string without capturing.

   `(`''expression''`)`:   Matches the expression between the parentheses.  Quantifiers applied to the parenthesized expression apply to the entire expression.  The match is captured and can be referred to via the ''back reference'' mechanism.

   `(?:`''expression''`)`:   Like `(`''expression''`)`, but the match is not captured.

   ''`exression`''`|`''`expression`'' ''`...`'':   `|` delimits branches of atoms, only one of which is chosen as the matching pattern.  In choosing a branch, the longest match is preferred.  An empty branch matches the [empty string].



** Quantifiers **

A quantifier specifies how many times the preceding atom should match.  A
greedy quantifier matches as many times as possible and a thrifty quantifier
matches as few as possible.  A quantifer without an explicit preference uses
the preference of its branch.  If the branch has no preference, a quantifier
sets the preference of the branch.  Once the greediness for a branch is set, it
does not change.

Quantifiers that may make their branch greedy:


   `*`:   Matches zero or more times.

   `+`:   Matches a one or more times.

   `?`:   Matches zero or one times.

   `{m}`:   Matches exactly ''m'' times.

   `{m,}`:   Matches at least ''m'' times.

   `{m,n}`:   Matches at least ''m'' and at most ''n'' times. ''n'' currently may not be greater than 255.


Quantifiers that may make their branch thrifty, and that are thrifty in any case:

   `*?`:   Matches zero or more times.

   `+?`:   Matches one or more times.

   `??`:   Matches zero or one times.

   `{m}?`:   Matches exactly ''m'' times.  Identical to `{m}`, except for the possible side effect of setting greediness of an branch.


'''Examples''':

A greedy `.+`:

======
% regexp a.+3 abc123abc123 var
1
% set var
abc123abc123
======

A non-greedy `.+?`:

======
% regexp a.+?3 abc123abc123 var
1
% puts $var
abc123
======

In the following example `.*` matches as much as possible, which is everything from the first 
`<b>` to the ''last'' `</b>`, inclusive.

======
% set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>}
% regexp <b>.*</b> $str var
% set var
<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>

======

The nongreedy variant matches from first `<b>` to first subsequent `</b>`, inclusive:


======
% set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>}
% regexp <b>.*?</b> $str var
% set var
<b>Some Bold Text</b>
======


Since the greediness of a branch is configured by the first thing in it that
has any greediness, placing a "[no-op]" atom like `(:?)*?` at the beginning of
a branch ensures that it is non-greedy.  In the following example, `.*`
configures the branch to be non-greedy:

======
regexp -nocase -- {(:?)*?<b>.*</b>} $str var
======




''' See Also '''

   [Henry Spencer]'s reply in [http://groups.google.com/d/msg/comp.lang.tcl/FddeFPbTFw8/asoMuv7dWqIJ%|%tcl 8.2 regexp not doing non-greedy matching correctly], [comp.lang.tcl], 1999-09-20.:   

   [http://www.qcode.co.uk/tcl-regular-expressions-greedy-or-non-greedy/index.html%|%Tcl Regular Expressions – Greedy or Non-greedy?], David Osborne, 2013-03-22:   a case study in greediness, and how to select greedy or non-greedy for any expression.

   [http://groups.google.com/d/msg/comp.lang.tcl/GzN9oXAbyK0/3xZpxLpI2dEJ%|%Regexp: Matching pairs of characters], [comp.lang.tcl], 2001-11-28:   



** Wielding Greediness **

One trick with non-greedy quantifiers is to anchor the expression to the
beginning/end of the string, which has the effect of '''stretching''' the
non-greedy match:

======
%regexp -inline (.*?)(n+)(.*) ennui
en e n {} 
%regexp -inline ^(.*?)(n+)(.*)$ ennui
ennui e nn ui
======



** Constraints **

A constraint specifies how an atom may match within a string.

   `^`:   Matches the beginning of a string.

   `$`:   Matches the end of the string.   While it is common to think of this character matching the newline, it actually just matches the end of the string, not some character at the end, and one cannot manipulate the newline by, for instance, trying to replace the symbol by the [empty string], etc.

   `(?=`''`re`''`)`:   Positive lookahead (AREs only), matches at any point where a substring matching re begins.

   `(?!`''`re`''`)`:   Negative lookahead (AREs only), matches at any point where no substring matching re begins.

The lookahead constraints may not contain back references (see later), and all parentheses within them are considered non-capturing.

An RE may not end with `\`.

----

[AMG]: I wish Tcl had [Vim]-like `\zs` and `\ze` to specify where the match officially starts and ends within a regular expression.  To demonstrate their use, first here's how `\ze` works to implement positive lookahead:

`hello(?= world)` = `hello\ze world`

But what if we wanted to match "world" only when preceded by "hello"?  How can that be done in Tcl regular expressions?  In Vim, it's dead simple:

`hello \zsworld`

In the general case, `\zs` and `\ze` allow the regular expression to specify the leading and/or trailing contexts under which it will match.  This is simpler and more flexible than matching the whole thing (context and all), then postprocessing the result to pick out only the desired portion.  The reason is the leading and trailing contexts of one match may overlap any part of another match.  This gets tricky because the leading contexts of two matches may overlap, yet two match bodies may not overlap.

Here's a more general example that matches every instance of "x" that is both preceded and followed by "x":

`x\zsx\zex`

Given the input "xxxxyxxx", it would match at zero-based indexes 1, 2, and 6:

======none
xxxxyxxx
.12...6.
======



** Metasyntax **

A sequence having the form `(?`metachars`)`, directs the operation of the regular expression system.  For example, `(?i)` makes all further matching insensitive to character case.



** Other Users of Tcl Regular Expressions **

[BAS] : just a tidbit, [Postgresql] uses Tcl's regexp engine for its
own regexp handling; see [http://archives.postgresql.org/pgsql-announce/2003-02/msg00008.php%|%PostgreSQL Weekly News - Feb 12th 2003].



** [gotcha]:   The Earliest Match Wins **

[AvL]:  I'll now mention some common pitfall with non-greedy REs: Let's go back to
the first example, but with a modified string:

======
% regexp a.+?3 abc123ax3 var
% set var
abc123
======

Although second possible match `ax3` would be shorter, `abc123` is selected because even with non-greedy quantifiers, the first match always wins.

[PYK] 2019-08-15:  Yes, greediness of an atom, sub-expression or branch does not change the rule that the earliest match is preferred.



** [gotcha]: Interaction Between Quantifiers with Different Greediness  **

The greediness of a branch applies to all quantifiers in the branch. In the
following example, one might expect `$last` to contain the rest of the string,
but it doesn't:

======
% regexp {([^\s]*?)(\s+)(.*)} {a large heart and a large head} full first delim last]}
{a } a { } {}
======

`([^\s]*?)` matched the first `a`, `(\s+)` matched the first whitespace
character, and `(.*)` matched nothing.  Why?  The branch is non-greedy because
the first thing with greediness, `([^\s]*?`, is non-greedy, and therefore
everything else in the branch is non-greedy as well.



** Expanded Syntax **


[Henry Spencer] writes

======none
 >...You can't put extra spaces into regular
 >expressions to improve readability, you just have to suffer along
 >with the rest of us.
======

Actually, since 8.1 you can, although since it's one of ''57'' new features,
it's easy to miss.  Like so:

======
set re {(?x)
    \s+ ([[:graph:]]+)      # first number
    \s+ ([[:graph:]]+)      # second number
}
set data "     -1.2117632E+00     -5.6254282E-01"
regexp $re $data match matchX matchY
======

The initial `(?x)`, which must be right at the start, puts the regexp parser
into expanded mode, which ignores white space (with some specific exceptions)
and `#`-to-end-of-line comments.



** Compiling Regular Expressions **

`[regexp]` and `[regsub]` store a compiled representation of the expression as
its [Tcl_Obj%|%internal representation].  In subsequent calls to `[regsub]` or
`[regexp]`, this internal representation is used if available.


See also [http://sourceforge.net/p/tcl/feature-requests/312/%|%Feature Request #312:
Function to Compile a Pattern], 2003

It isn't really necessary to "precompile" a regular expression, but this would
have a similar effect:

======
regexp -about $RE
======

[AMG]: The only purpose I see for this is to detect syntax errors early rather than wait until the regular expression is used.



** Comma Number Formatting **

Some folks insist on inserting commas (or other characters) to format digits
into groups of three.  According to [Keith Vetter], the Perl manual describes
the following method:

======none
1 while s/^([-+]?\d+)(\d{3})/$1,$2/;
======

This is a Tcl pre-8.1 translation:

======
set n 123456789.00
while {[regsub {^([-+]?[0-9]+)([0-9][0-9][0-9])} $n {\1,\2} n]} {}
puts $n
======

results in

======none
123,456,789.00
======

escape characters and quantifiers tighten this up a bit:

======
while {[regsub {^([-+]?\d+)(\d{3})} $n {\1,\2} n]} {}
======

Using the extended syntax makes it a little easier to read:

======
while {[regsub {(?x)
        ^([-+]?\d+)     # The number at the start of the string...
        (\d{3})         # ...has three digits at the end
} $n {\1,\2} n]} {
                                        # So we insert a comma there and repeat...
}
======

[RS]: For a version with configurable separator, see [Bag of algorithms], item
"Number commified"

[Ro]: See also [Human readable file size formatting] for a version without
regular expressions for those of us who are allergic to monstrous complexity ;)



** Excluding Patterns from a Match **


[NEM]: A question in the [Tcl Chatroom] brought up a common problem that I've had
when dealing with regular expressions. `[[^AB]` means
"not A or B", but what if you want to match anything but the string `AB`? The
only way to do it is to put lots of negated classes one after the other, which
is ugly. So, here is a way to wrap that up into something a bit more elegant:

======
proc not pattern {
    # Not capturing bracket
    set ret {(?:}
    foreach char [split $pattern {}] {
        append ret \[^$char\]
    }
    append ret )
    return $ret
}
======

Then you can do:

======
regexp -- "AB([not AB]*)AB(.*)" ABcdefghABijklmnopqrst -> first rest
======

`$first` is `cdefgh` and `$rest` is `ijklmnopqrst`.

It handles things like:

======
regexp -- "AB([not AB]*)AB(.*)" ABcdefghBAijkABlmnopqrst -> first rest
======

`$first` is `cdefghBAijk` and `$rest` is `lmnopqrst`.

But that this only matches match patterns that are at least the same
length as the negated expression:

======
regexp -- "AB([not AB]*)AB(.*)" ABcABslkdjf -> first rest ;#-> 0
======

The proper solution to this problem is a lot more complex, unfortunately.

----

The above three expressions can be written using a lookahead constraint.

======
foreach str {ABcdefghABijklmnopqrst ABcdefghBAijkABlmnopqrst ABcABslkdjf} {
    set e "regexp -- {AB(?!AB)(.*)AB(.*)} $str -> first rest"
    puts "$e\n=> [eval $e]\nfirst = $first\nrest = $rest\n"
}
======

Output:

======none
regexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghABijklmnopqrst -> first rest
#-> 1
#first -> cdefgh
#rest -> ijklmnopqrst

egexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghBAijkABlmnopqrst -> first rest
#-> 1
#first -> cdefghBAijk
#rest -> lmnopqrst

regexp -- {AB(?!AB)(.*)AB(.*)} ABcABslkdjf -> first rest
#-> 1
#first -> c
#rest -> slkdjf
======

----

Important to note on the `[[not pattern]]` example above is that it '''does not'''
match strings where there is an occurrence of the first letter from ''pattern''
when not part of the entirety of ''pattern'':

======none
% regexp -- "AB([not AB]*)AB(.*)" ABcdefghAijklmABnopqrst -> first rest
0
% regexp -- "AB([not AB]*)AB(.*)" ABcdefghBijklmABnopqrst -> first rest
1
% set first
cdefghBijklm
% set rest
nopqrst
======

----

[DKF]: It's actually fairly easy to request that an RE shouldn't match
something. You just need some magic around it like this:

======
regexp {^(?:(?!AB).)*$} $string
======

That matches any string that doesn't contain `AB` as a subsequence.




** Non-reporting sub-patterns **

I would love to see a some clarification on exactly how non-reporting
subpatterns work with `-inline`, specifically if you can silence the overall
pattern match:

======none
% set str {

asd;flkj <img src="example.jpg" >
sad;lfjl;kjf<IMg src="browser/ie.gif">
asdflaj;lkfjasdf
lsdk

}

% set _Img {<img src="?([\w\./]*)"?[^>]*>}
<img src="?([\w\./]*)"?[^>]*>
% regexp -all -nocase -inline $_Img $str
{<img src="example.jpg" >} example.jpg {<IMg src="browser/ie.gif">} browser/ie.gif
======

[glennj]: You can't silence the full match.  You will have to iterate over the
results of `[regexp]`:

======
set matches {} 
foreach {full submatch} [regexp -all -nocase -inline $_Img $str] {
    lappend matches $submatch
}
======



** Matching Optional Parts **

[elfring] 2004-07-05: Does anybody know problems and solutions to match
optional parts with regular expressions?

See
[http://groups.google.com/d/msg/comp.lang.tcl/IHFBMacaxAY/oZ40zmqpRg0J%|%Matching
of optional parts in regular expressions], [comp.lang.tcl], 2004-07-01.


[MG] 2004-07-17: The problem with the regexp there seems to be that one of
the parts to match optional white space is in the wrong place, and is matching
too much. If you use this regexp instead, it works for me, on Win XP with Tcl
8.4.6. (The change is that, after </S_URI> and before <P_URI>, `.*?` has been
moved inside `(?: ... )`

======
set pattern {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>(?:.*?<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}

set string {<name>gruss</name>
<scope>SYSTEM</scope>
<S_URI>http://XXX/Hallo.dtd</S_URI>
<P_URI>http://YYY/Leute.dtd</P_URI>
<definition><!ELEMENT gruss (#PCDATA)></definition>
<attributes>Versuch="1"</attributes>
<content><h1>Guten Tag!</h1></content>}

regexp $pattern $str z name scope system public definition attributes content
======



** A Regular Expression to Match Many Things in Any Order **

[DKF]: Sometimes it is useful to be able to write a regular expression that
matches a string that contains some number of substrings (typically words) in
any order. In normal regexps, this is a horrible thing to write down as the
size of the RE term varies exponentially with the number of substrings.
However, if you don't mind matching behaviour that is guaranteed to be
non-optimal in some strict sense, and if you don't want ''any'' capturing
parens, you can use positive lookahead assertions to make things neater.

Thus, to match a string that contains '''foo''', '''bar''' and '''spong'''
within it in any order, use a RE like this:

======
set RE {(?=.*foo)(?=.*bar)(?=.*spong).}
set matched [regexp $RE $string]
======

Just note that if you use this, you ''cannot'' know where those strings
matched; lookahead assertions don't support that. If you need that data, use
multiple [regexp] matches instead



** Looping Over Matches **

[MAH]: What yould be the correct way to loop over all matches of a regular
expression in a string? I came up with the following solution

for finding all include statements in a string, but using `-start` has side
effects on the meaning of characters like `$` and `^`.

======
set pos 0
while {[regexp -start $pos {`include "([\w/.]+)"} $data string vincfile]==1} {
    set pos [expr {$pos+[string length $string]}]
    puts "file=$vincfile"
} 
======

[Lars H]: The option combination '''-all -inline''' is probably what you're
looking for (although in general the problem of "finding all matches" runs into
several technical issues, due to the fact that matches may overlap).

In combination with `-start`, one has to use `\A` and `\Z` instead of `$` and
`^`, unless the intent is to use the newline-sensitive behaviours of the
latter. -indices may also be useful.

[MAH]: Okay, `-inline` is too clumsy for me since I don't want the overall
match string. Instead I'll go with `-indices`. This gives me

======
set pos 0
while {[regexp -start $pos -indices {`include "([\w/.]+)"} $data -> vincfilepos]==1} {
    set vincfile [ string range $data [ lindex $vincfilepos 0 ]  [ lindex $vincfilepos 1 ] ]
    set pos [ lindex $vincfilepos 1 ]
    puts "file=$vincfile"
} 
======

I think that's rather clumsy for a task this common. Any ideas on how to make
it simpler?

[PYK] 2014-09-12:

======
set data {
`include "one.h"
`include "two.h"
}

foreach {all indices} [
    regexp -all -inline -indices {`include "([\w/.]+)"} $data] {

    lassign $indices first last
    set vincfile [string range $data $first $last]
    puts file=$vincfile
}
======

[HE] 2016-01-28: With the given regular expression I would use:

======
set data {
`include "one.h"
`include "two.h"
}

foreach {-> vincfile} [regexp -all -inline -- {`include "([\w/.]+)"} $data] {
        puts "file=$vincfile"
}
======



** Reporting Matches in Branches **

[PYK] 2016-09-11: Reportable matches are indexed according to the order in
which they appear lexically in the regular expression, and all reporting
matches in all branches are indexed.  Therefore, there's a trick to extracting
a piece of data where branches are used to express the conditions the data
might be found under.  Since only one branch is represented in the reported
matches, the reported matches from other branches are guaranteed to be the
[empty string], so the indexed matches can simply be concatenated.  For
example:

======
set url http://wiki.tcl.tk/461
set pattern {^(https?:\/\/wiki.tcl.tk\/)(?:.*\?redir=([0-9]+)|_\/edit\?N=([0-9]+)|([0-9]+))$}
regexp $pattern $url -> prefix b1 b2 b3
set result $b1$b2$b3 ;# -> 461
======



** History **

Tcl switched to Advanced Regular Expressions in version [Changes in Tcl/Tk
8.1%|%8.1].




** Misc **


The above discussion needs to cover the advanced regular expression syntax as
handled by Tcl 8.1 and later and show the user what all the differences are, so
that one can write portable code when necessary - or at least create
appropriate package require statements.

----

[KBK] has astutely remarked that, "Much of the art of designing recognizers is
the art of controlling such things in the common cases; regexp matching
in general is PSPACE-complete, and our extended regexps are even worse (... not
bounded by any tower of exponentials ...)."
[http://sourceforge.net/mailarchive/forum.php?thread_id=29946064&forum_id=6718]

[Lars H] 2008-06-01: Somehow I doubt [KBK] would say that, in part because it's
'''dead wrong''' as far as basic [regular expression]s are concerned — given a
regular expression of size ''m'' and a string of size ''n'' it is always
possible to test whether the string matches that regular expression in time
that is linear in ''n'' and polynomial in ''m''. Googling for "regexp matching
PSPACE complete" turns up this page, but otherwise rather suggests that ''other
problems'' concerning regular expressions, in particular deciding whether two
regular expressions are equivalent, may be PSPACE-complete. (Which is actually
kind of interesting, since the naive determinization algorithm for this might
need exponential amounts of memory and thus not be in PSPACE at all, but
off-topic.)

The link provided as source currently doesn't work (no surprise, it's into
[SourceForge] mail archives), but the forum_id seems to refer to development of
a [Perl] module (text::similarity, in some capitalization) rather than anything
Tcl related. That matching using Perl's so-called "regexps" should be "worse
than PSPACE-complete" is something I can believe, so in that context the quote
makes sense, but why it should then be attributed to [KBK], and moreover why it
should appear in this Wiki (added 2006-08-09, in revision 35 of this page), is
still a mystery.

[LV]:  Searching http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/tcl-core
(well, actually from what I can see, one can only search ALL of activestate's
mailing list archives), doesn't turn up a reference like this. Maybe it is
quite old - before activestate?

[TV]: Auw, man... That´s like suggesting something like the [traveling salesman
problem] is only there to upset people that a certain repository will have the
perfect solution for this type of problem, but like that the actual worlds´
best sorting algorithm (O(log(2.1...)) has gotten lost in a van with computer
tapes from some university in 1984 or so, the whole of "datastructures and
algorithms" will end up like the ´English IT Show´ on the Comedy Channel, and
than on the [Who says Tcl sucks...] graveyard like the [connection machine] was
great but forgotten and the world´s greatest synthesizer developers/researchers
are in "The Dead Presidents Society" (CEO´s that is, like ´the dead Poets
Society´).



** Page Authors **

   [DKF]:   

   [LV]:   

   [MG]:   

   [PYK]:   

<<categories>> Tutorial | String Processing | Arts and crafts of Tcl-Tk programming