Version 65 of Regular Expressions

Updated 2014-03-25 21:12:46 by AMG

Tcl Regular Expressions, implemented by Henry Spencer, are called Advanced Regular Expressions.

Disambiguation

"Regular Expressions" is the name of an at-least-monthly column on scripting languages CL has co-authored from 1998 to 2009. See Cameron Laird's Personal Notes on Regular Expressions .

See Also

regular expression
about regular expressions in general, including features not found in Tcl, and the theory behind them.
regexp
regsub
re_syntax
New Regular Expression Features in Tcl 8.1
Beginning Regular Expressions
Regular Expression Examples
Advanced Regular Expression Examples
Regular Expression Debugging Tips
Henry Spencer's "Tcl" Regex Library ,comp.compilers, 2007-10-01
regex - Henry Spencer's regular expression libraries
links to Henry Spencer's original release in 1994, "regex3.8a.tar.gz", that was included in 4.4 BSD Unix, Walter Waldo's port of Henry's Tcl regex in Tcl 8.1, and Thomas Lackener's port from Tcl-8.5a3
Chapter 11: Regular Expressions ,Book Practical Programming in Tcl and Tk , 3rd Edition, by Brent Welch
Tcl Scores High in RE Performance ,Cameron Laird and Kathryn Soraiz ,2007-02
Drawbacks of Tcl's Regexps
TCL regex implementation beats whole competition. Even C with PCRE and Boost regex, (shootout.alioth.debian.org) ,reddit.com ,2008-02-07
Unicode and Localisation Support
interesting in a Tcl context because it illustrates the Perl regular expressions, being derived from Spencer's earlier regular expression attempts, suffer Unicode issues that Tcl doesn't
Regular Expressions: Now You Have Two Problems ,Jeff Atwood ,2008-06-27
some pulpit-thumping about regular expressions

Resources

Komodo
includes a regular expression debugger
Visual REGEXP
Visual REGEXP
a regular expression toolkit
tkWorld
contains tkREM which is a regular expression maker. Perhaps someone familar with it would like to discuss it.
^txt2regex$
a Regular expression wizard written in bash2 that converts human sentences into regular expressions. It can be used to build up regular expressions suitable for use in Tcl.
redet
another tool to assist in developing regular expressions
nre30.tar.gz
one of a couple extensions prior to Tcl 8.1 that that provided a superset of regular expression functionality. This does not provide all the power of Tcl 8.1 and newer, but at least it is more than was available before 8.1.
tcLex
a lexical analyzer which uses Tcl regular expressions to do the matching
Yeti
another lexical analyser, parser generator.

Design

Tcl's regular expression engine is an interesting and subtle object for study in its own regard. While Perl is the language that deserves its close identification with RE capabilities, Tcl's engine competes well with it and every other one. In fact, although he doesn't favor Tcl as a language, RE expert Jeffrey Friedl has written that "Tcl's [RE] engine is a hybrid with the best of both worlds."

See Henry Spencer's reply in tcl 8.2 regexp not doing non-greedy matching correctly ,comp.lang.tcl ,1999-09-20.

Most common regular expression implementations (notable perl and direct derivatives of the PCRE library) exhibit poor performance in certain pathological cases. Henry Spencer's complete reimplementation as a "hybrid" engine appears to address some of those problems. See Regular Expression Matching Can be Simple and Fast (but is slow in Java, Perl PHP, Python, Ruby,...) , Russ Cox ,2007-01, for some fascinating benchmarks.

Lars H: A very nice paper! Highly recommended for anyone interested in the internals of regular expression engines, and a good introduction to the theory.

Description

regexp and regsub accept argumetns that are Tcl regular expressions.

The following applies to Tcl 8.1 or newer.

Regular expression atoms and metacharacters fall into one of several classes. The first type express a specific character is to be matched.

literal
Any alphabetic, numeric, white space character are frequently treated as literal matches. However, there are a few cases, detailed below, where they are used in a metacharacter construct.
characters
means "match any of the characters in the brackets." If the first character inside the braces is caret (^), the meaning is changed to "match any character not in the brackets". A minus sign (-) between any two characters in the brackets means "all characters between these two."
.
matches any literal character
\k
When k is non-alphanumeric, the atom matches the literal character k
\c
When c is alphanumeric (possibly followed by other characters), the sequence is called a Regular Expression Escape Sequence

The second type are modifiers to regular expression atoms, expressing the quantity of characters to be matched. These are called quantified atoms.

*
The largest series (zero or more occurances) of the preceeding regular expression atom will be matched.
+
The largest series (one or more occurances) of the preceeding regular expression atom will be matched.
?
This is a boolean type quantifier - it means the atom may or may not appear (i.e. it may appear 0 or 1 times).
{m}
a sequence of exactly m matches of the atom
{m,}
a sequence of m or more matches of the atom
{m,n}
a sequence of no less than m and no more than n matches of the atom. LV In Tcl, this has an upper limit of 255.
*?
non-greedy form of * quantifier - if there is more than one match, selects the smallest of the matches
+?
non-greedy form of + quantifier
??
non-greedy form of ? quantifier
{m}?
non-greedy form of {m} quantifier

The third type of regular expression metacharacters modify a regular expression atom by placing constraints upon the characters being matched.

^
Matches the beginning of a string.
$
Matches the end of the string. While it is common to think of this character matching the newline, it actually just matches the end of the string, not some character at the end, and one cannot manipulate the newline by for instance trying to replace the symbol by a null string, etc.

The fourth type of regular expression metacharacters are used for expressing groupings.

(regular expression)`
Parenthesis surrounding one ore more regular expressions specify a nested regular expression or choice of several regular expressions. The substring matching the nested regular expression is captured and can be referred to via the back reference mechanism, and alo captured into the corresponding match variable specified as an arbument to the command.
(?:''regular expression)
Specifies that any match against the grouped regular expression will not be captured as a back reference or into a match variable.
()
This matches an empty string, returning the match in a variable.
(?:)
matches an empty string, but does not return the results in a match variable.
regular expression|regular expression
The vertical bar (pipe) character indicates a choice between the regulare expressions on either side. In other words, "match RE-a OR RE-b". Each side of the symbol is called a branch''. Each branch is an independent regular expression. Empty branches match an empty string.

Other Users of Tcl Regular Expressions

BAS : just a tidbit, the Postgresql DBMS uses Tcl's regexp engine for its own regexp handling; see

[L1 ].

Greedy vs Non-greedy

See Also

Henry Spencer's reply in tcl 8.2 regexp not doing non-greedy matching correctly ,comp.lang.tcl ,1999-09-20.
Tcl Regular Expressions – Greedy or Non-greedy? ,David Osborne ,2013-03-22
a case study in greediness, and how to select greedy or non-greedy for any expression.

Regular expression are either greedy or non-greedy, regardless of your mixture of greedy/non-greedy metacharacters. Refer to Regexp: Matching pairs of characters ,comp.lang.tcl ,2001-11-28 , and to tcl 8.2 regexp not doing non-greedy matching correctly ,comp.lang.tcl ,1999-09-20

MG: OK, I'm sure someone can do this better than me, but since nothing's here at the moment I'll make a start...

By default, the regexp characters + and * match as much as possible (which is called greedy matching). By placing a ? after them, you can make them match as little as possible (non-greedy). For example...

regexp a.+3 abc123abc123 var
set var

output:

abc123abc123

.+ matched all the characters up until the last 3. In contrast:

regexp a.+?3 abc123abc123 var
puts $var

output:

abc123

because +? matched as little as possible.

Greedy regexp matching is a particular problem in parsing HTML, etc, because...

set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>}
regexp <b>.*</b> $str var
puts $var

output:

Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold
Text

.* matched as much as possible. It began matching takes at the first occurance of <b> and didn't quit until the last occurance of </b>. But, using a non-greedy regexp to match...

set str {<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>}
regexp <b>.*?</b> $str var
puts $var

output:

Some Bold Text

Hope that explanation/rambling is some use, at least until someone with more idea what they're doing puts something up :)

AvL I'll now mention some common pitfall with non-greedy REs: Lets go back to the first example, but with a modified string:

regexp "a.+?3" "abc123ax3" var
set var

Although second possible match ax3 would be shorter, it will still find the first match abc123, because even with non-greedy quantifiers, the first match always wins.


One trick with non-greedy quantifies is to anchor the expression to the beginning/end of the string, which has the effect of stretching out the non-greedy match:

regexp -inline (.*?)(n+)(.*) ennui ;#-> en e n {} 
regexp -inline ^(.*?)(n+)(.*)$ ennui ;#-> ennui e nn ui

Expanded Syntax

Henry Spencer writes

 >...You can't put extra spaces into regular
 >expressions to improve readability, you just have to suffer along
 >with the rest of us.

Actually, since 8.1 you can, although since it's one of 57 new features, it's easy to miss. Like so:

set re {(?x)
    \s+ ([[:graph:]]+)      # first number
    \s+ ([[:graph:]]+)      # second number
}
set data "     -1.2117632E+00     -5.6254282E-01"
regexp $re $data match matchX matchY

The initial (?x), which must be right at the start, puts the regexp parser into expanded mode, which ignores white space (with some specific exceptions) and #-to-end-of-line comments.

Compiling Regular Expressions

The first time a variable passed as a regular expression of [regexp] or [regsub], it is compiled, and the compiled value is cached in the internal variable structure. To compile a regulare expression:

regexp $RE {}

Feature Request #312: Function to Compile a Pattern ,2003

DKF: I prefer to use

regexp -about $RE

to do the compilation, but that's probably a matter of style.

DKF: Technically, the compiled RE is cached in the internal representation of the RE value and not the variable. The effect is pretty much indistinguishable though (in all sane programs).

Comma Number Formatting

Some folks insist on inserting commas (or other characters) to format digits into groups of three. Here is a regexp to do the trick from Keith Vetter. (Thanks Keith!) The Perl manual describes a very slick method of doing this:

1 while s/^([-+]?\d+)(\d{3})/$1,$2/;

Translated into (pre 8.1) tcl you get:

set n 123456789.00
while {[regsub {^([-+]?[0-9]+)([0-9][0-9][0-9])} $n {\1,\2} n]} {}
puts $n

results in

123,456,789.00

You can tighten this up a little using Tcl 8.1's regular expressions:

while {[regsub {^([-+]?\d+)(\d{3})} $n {\1,\2} n]} {}

Using the extended syntax, this becomes a bit easier to understand:

while {[regsub {(?x)
        ^([-+]?\d+)     # The number at the start of the string...
        (\d{3})         # ...has three digits at the end
} $n {\1,\2} n]} {
                                        # So we insert a comma there and repeat...
}

RS: For a version with configurable separator, see Bag of algorithms, item "Number commified"

Ro: See also Human readable file size formatting for a version without regular expressions for those of us who are allergic to monstrous complexity ;)

Exluding Patterns from a Match

NEM: A question on the Tclers Chat brought up a common problem that I've had when dealing with regular expressions. The RE engine allows [^AB] to mean "not A or B", but what if you want to match anything but the string AB? The only way to do it is to put lots of negated classes one after the other, which is ugly. So, here is a way to wrap that up into something a bit more elegant:

proc not {pattern} {
    set ret {(?:}     ;# Not capturing bracket
    foreach char [split $pattern {}] {
        append ret "\[^$char\]"
    }
    append ret ")"
    return $ret
}

Then you can do:

regexp -- "AB([not AB]*)AB(.*)" ABcdefghABijklmnopqrst -> first rest
first = "cdefgh"
rest = "ijklmnopqrst"

And it handles things like:

regexp -- "AB([not AB]*)AB(.*)" ABcdefghBAijkABlmnopqrst -> first ret
#first -> cdefghBAijk
#rest -> lmnopqrst

But that this will only match patterns which are at least the same length as the negated expression:

regexp -- "AB([not AB]*)AB(.*)" ABcABslkdjf -> first rest ;#-> 0

The proper solution to this problem is a lot more complex, unfortunately.


The above three regexp's can be written using a lookahead constraint.

foreach str {ABcdefghABijklmnopqrst ABcdefghBAijkABlmnopqrst ABcABslkdjf} {
    set e "regexp -- {AB(?!AB)(.*)AB(.*)} $str -> first rest"
    puts "$e\n=> [eval $e]\nfirst = $first\nrest = $rest\n"
}

Output:

regexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghABijklmnopqrst -> first rest
#-> 1
#first -> cdefgh
#rest -> ijklmnopqrst

egexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghBAijkABlmnopqrst -> first rest
#-> 1
#first -> cdefghBAijk
#rest -> lmnopqrst

regexp -- {AB(?!AB)(.*)AB(.*)} ABcABslkdjf -> first rest
#-> 1
#first -> c
#rest -> slkdjf

Important to note on the "[not pattern]" example above is that it will NOT match strings where there is an occurrence of the first letter from {pattern} when not part of the entirety of {pattern}:

% regexp -- "AB([not AB]*)AB(.*)" ABcdefghAijklmABnopqrst -> first rest
0
% regexp -- "AB([not AB]*)AB(.*)" ABcdefghBijklmABnopqrst -> first rest
1
% set first
cdefghBijklm
% set rest
nopqrst

DKF: It's actually fairly easy to request that an RE shouldn't match something. You just need some magic around it like this:

regexp {^(?:(?!AB).)*$} $string

That matches any string that doesn't contain AB as a subsequence.

Non-reporting sub-patterns

I would love to see a some clarification on exactly how non-reporting subpatterns work with -inline, specifically if you can silence the overall pattern match:

% set str {

asd;flkj <img src="example.jpg" >
sad;lfjl;kjf<IMg src="browser/ie.gif">
asdflaj;lkfjasdf
lsdk

}

% set _Img {<img src="?([\w\./]*)"?[^>]*>}
<img src="?([\w\./]*)"?[^>]*>
% regexp -all -nocase -inline $_Img $str
{<img src="example.jpg" >} example.jpg {<IMg src="browser/ie.gif">} browser/ie.gif

glennj: You can't silence the full match. You will have to iterate over the results of [regexp]:

set matches [list]
foreach {full submatch} [regexp -all -nocase -inline $_Img $str] {
    lappend matches $submatch
}

Matching Optional Parts

elfring 2004-07-05: Does anybody know problems and solutions to match optional parts with regular expressions?

See Matching of optional parts in regular expressions ,comp.lang.tcl ,2004-07-01.

MG July 17th 2004: The problem with the regexp there seems to be that one of the parts to match optional white space is in the wrong place, and is matching too much. If you use this regexp instead, it works for me, on Win XP with Tcl 8.4.6. (The change is that, after </S_URI> and before <P_URI>, .*? has been moved inside (?: ... )

set pattern {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>(?:.*?<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$}

set string {<name>gruss</name>
<scope>SYSTEM</scope>
<S_URI>http://XXX/Hallo.dtd</S_URI>
<P_URI>http://YYY/Leute.dtd</P_URI>
<definition><!ELEMENT gruss (#PCDATA)></definition>
<attributes>Versuch="1"</attributes>
<content><h1>Guten Tag!</h1></content>}

regexp $pattern $str z name scope system public definition attributes content

A Regular Expression to Match Many Things in Any Order

DKF: Sometimes it is useful to be able to write a regular expression that matches a string that contains some number of substrings (typically words) in any order. In normal regexps, this is a horrible thing to write down as the size of the RE term varies exponentially with the number of substrings. However, if you don't mind matching behaviour that is guaranteed to be non-optimal in some strict sense, and if you don't want any capturing parens, you can use positive lookahead assertions to make things neater.

Thus, to match a string that contains foo, bar and spong within it in any order, use a RE like this:

set RE {(?=.*foo)(?=.*bar)(?=.*spong).}
set matched [regexp $RE $string]

Just note that if you use this, you cannot know where those strings matched; lookahead assertions don't support that. If you need that data, use multiple regexp matches instead

Looping over matches

MAH: What yould be the correct way to loop over all matches of a regular expression in a string? I came up with the following solution

for finding all include statements in a string, but using -start has side effects on the meaning of characters like $ and ^.

set pos 0
while {[regexp -start $pos {`include "([\w/.]+)"} $data string vincfile]==1} {
    set pos [expr {$pos+[string length $string]}]
    puts "file=$vincfile"
} 

Lars H: The option combination -all -inline is probably what you're looking for (although in general the problem of "finding all matches" runs into several technical issues, due to the fact that matches may overlap).

In combination with -start, one has to use \A and \Z instead of $ and ^, unless the intent is to use the newline-sensitive behaviours of the latter. -indices may also be useful.

MAH: Okay, -inline is too clumsy for me since I don't want the overall match string. Instead I'll go with -indices. This gives me

set pos 0
while {[regexp -start $pos -indices {`include "([\w/.]+)"} $data -> vincfilepos]==1} {
    set vincfile [ string range $data [ lindex $vincfilepos 0 ]  [ lindex $vincfilepos 1 ] ]
    set pos [ lindex $vincfilepos 1 ]
    puts "file=$vincfile"
} 

I think that's rather clumsy for a task this common. Any ideas on how to make it simpler?

Misc

The above discussion needs to cover the advanced regular expression syntax as handled by Tcl 8.1 and later and show the user what all the differences are, so that one can write portable code when necessary - or at least create appropriate package require statements.


KBK has astutely remarked that, "Much of the art of designing recognizers is the art of controlling such things in the common cases; regexp matching in general is PSPACE-complete, and our extended regexps are even worse (... not bounded by any tower of exponentials ...)." [L2 ]

Lars H 2008-06-01: Somehow I doubt KBK would say that, in part because it's dead wrong as far as basic regular expressions are concerned — given a regular expression of size m and a string of size n it is always possible to test whether the string matches that regular expression in time that is linear in n and polynomial in m. Googling for "regexp matching PSPACE complete" turns up this page, but otherwise rather suggests that other problems concerning regular expressions, in particular deciding whether two regular expressions are equivalent, may be PSPACE-complete. (Which is actually kind of interesting, since the naive determinization algorithm for this might need exponential amounts of memory and thus not be in PSPACE at all, but off-topic.)

The link provided as source currently doesn't work (no surprise, it's into SourceForge mail archives), but the forum_id seems to refer to development of a Perl module (text::similarity, in some capitalization) rather than anything Tcl related. That matching using Perl's so-called "regexps" should be "worse than PSPACE-complete" is something I can believe, so in that context the quote makes sense, but why it should then be attributed to KBK, and moreover why it should appear in this Wiki (added 2006-08-09, in revision 35 of this page), is still a mystery.

LV: Searching http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/tcl-core (well, actually from what I can see, one can only search ALL of activestate's mailing list archives), doesn't turn up a reference like this. Maybe it is quite old - before activestate?

TV: Auw, man... That´s like suggesting something like the traveling salesman problem is only there to upset people that a certain repository will have the perfect solution for this type of problem, but like that the actual worlds´ best sorting algorithm (O(log(2.1...)) has gotten lost in a van with computer tapes from some university in 1984 or so, the whole of "datastructures and algorithms" will end up like the ´English IT Show´ on the Comedy Channel, and than on the Who says Tcl sucks... graveyard like the connection machine was great but forgotten and the world´s greatest synthesizer developers/researchers are in "The Dead Presidents Society" (CEO´s that is, like ´the dead Poets Society´).