The wiki page [Regular Expressions] will take you to a place where Tcl's [ARE] are discussed. Technically, a '''regular expression''' is an ingenious notation (that was invented by Kleene in the 1950's) for specifying a regular language. (A "language" in this terminology is a set of strings, very often an infinite sets of strings. See http://en.wikipedia.org/wiki/Regular_language for formal definitions of ''regular'' language.) Practically, a regular expression looks like a summary of the various forms that the set of strings it describes may take. One conventionally uses `|` as a separator between alternative branches of the expression. One may also write * (the ''Kleene star'') after something to denote that it may be repeated zero or more times, and use parentheses to group things. A regular expression using all these things is A((b|cc)a)* which denotes a language consisting of the strings A Aba Acca Ababa Abacca Accaba Accacca Abababa and many others. An important factor for the popularity of regular expressions is their linear-time complexity: when a given string is to be matched against a regular expression, it is possible to do it so that every character in the string is only looked at once. This is attained by compiling the regular expression into a [finite automaton] — potentially a big chunk of work, but one that only needs to be done once for each regular expression — and then running the automaton with the string as input. Another important factor is that regular expressions can be used for efficiently ''searching'' through a large body of text. A direct implementation of the above would produce an algorithm for ''matching'' a string against a regular expression, but most RE implementations — among them [regexp] — play a few tricks internally that make them operate in search mode instead. In order to get matching behaviour (often useful with [switch] -regexp), one uses the ''anchors'' `^` and `$` to require that the particular position in the regular expression must correspond to the beginning and end of the string respectively (caveat: sometimes it is beginning and end of line instead; [ARE]s have `\A` and `\Z` as alternatives). Other common extensions to the regular expression syntax, which however doesn't make them any more powerful than the basic set described above, are: One-or-more quantifier (`+`): Similar to `*`, but excluding the case of no repetition. The RE “`(`''re''`)+`” is equivalent to “`(`''re''`)(`''re''`)*`” and “`(`''re''`)*(`''re''`)`”. Optional quantifier (`?`): Also called the zero-or-one quantifier. The RE “`(`''re''`)?`” is equivalent to “`(`''re''`|)`”. Bracket expression (`[[`''chars''`]]`): Effectively a shorthand — e.g. `[[abcd]]` is equivalent to `(a|b|c|d)` — but often far more compact. Unicode character classes often include thousands of character, so enumerating all of them would be unfeasible. Counted quantifiers (`{`''n''`}` or `{`''m''`,`''n''`}`): Exactly ''n'' occurrencies, or at least ''m'' and at most ''n'' occurrencies, respectively. Can as `?` be reduced to combinations of parentheses and `|`, but requires repeating the core RE the quantifier is applied to at least ''n'' times. AND: Boolean "and", like `|` is boolean "or". Uncommon in regexp engines, and not easily reduced to the fundamental operations, but nonetheless the intersection of two regular languages is again a regular language. The [grammar_fa] package uses `&` to denote this. NOT: Boolean negation. Uncommon in regexp engines, and not easily reduced to the fundamental operations, but nonetheless the complement of a regular language is also a regular language. The [grammar_fa] package uses `!` to denote this. Reversal: Change the direction in which the string is being matched; this may be useful to implement searching backwards in a text editor. There is no common syntax for this, but by hand transforming a regular expression accordingly is typically straightforward. A somewhat intriguing class of such extensions are the ''constraints'', which only match the empty string but refuse to do so unless the material surrounding the position of this match satisfies some condition. The general constraints are: Positive lookahead (`(?=`''re''`)`: The regular expression ''re'' must match the text that follows after this position. This is similar to AND, but different in that ''re1'' and ''re2'' in `((?=`''re1''`)`''re2''`)` are not required to match the same characters; ''re1'' can match a prefix of what ''re2'' matches, or vice versa. Negative lookhead (`(?!`''re''`)`: The regular expression ''re'' must ''not'' match the text that follows after this position. Positive lookbehind (`(?<=`''re''`)`: The regular expression ''re'' must match the text that comes before this position. This is not available in [ARE]s. Negative lookbehind (`(?:]]]]`): These are obvious combinations of lookhead and lookbehind constraints, where one checks e.g. that the next character is a word character and the previous character was not. Beginning or end of word (`\y`), not beginning or end of word (`\Y`): Ditto. Beginning of line (`^`), end of line (`$`): Similar to beginning and end of word. (In order to implement these into a finite automaton, one roughly has to form the product of an automaton for the main RE and an automaton for the constraint RE — the idea is that on must keep track of both the state visavi the main RE and the state visavi the constraint RE — which are then slightly modified so that the transition from one side to the other of the constraint takes one to a constraint-initial state. Might there be a problem if two constraints overlap, though?) The Tcl regexp engine handles lookaheads by compiling and running the constraint RE separately; in this way it is a "hybrid" RE engine. Another set of usual extensions to the syntax concern submatch extraction and greediness. These mainly become meaningful when regular expressions are used for ''searching'', as there in that case often are several substrings that match a particular RE, and it matters what match is reported. Non-greedy quantifiers: Conventionally one lets quantifiers default to being greedy (match as much as possible), and introduce non-greedy quantifiers (match as little as possible) as variants; usual syntax is that a greedy quantifier followed by a `?` becomes the corresponding non-greedy quantifier. Implementation-wise, the algorithm for non-greedy searching is easier than that for greedy searching. Noncapturing parentheses: Usual notation is `(?:`''re''`)`. The same as ordinary parentheses for searching and matching purposes, but tells the RE engine that it doesn't have to keep track of what range of a match corresponds to this parenthesis. Again, it is more work to keep track of this than to ignore it, but tradition dictates that parentheses should default to being capturing. Many "regular expression" engines also support extensions to the syntax which allow them to go ''beyond'' the realm of regular languages. This is most common in backtracking RE engines (such as [PCRE] and the [Perl] engine) which ignores the finite automaton theory and instead uses trial and error to find a match, but some have escaped into more general standards. Back references (`\`''n''): Matches the exact substring matched by the ''n''th capturing parenthesis in the RE. This is labelled '''the Feature from the Black Lagoon''' in the sources for the Tcl regexp engine. The (only?) non-regular feature found in the regular expressions of the [POSIX] standard. Practical applications include parsing data where one can introduce custom separator strings. Recursive match (`\R`) and subroutines: Behave as if a copy of the whole regular expression, or of some "named subexpression", occurred at this point when trying to match. Together with `|`, this is all one needs for a general context-free grammar [http://en.wikipedia.org/wiki/Context-free_grammar] (although the syntax is typically hideous compared to the standard BNF notation for these), which conversely means any engine capable of these features have to be ''at least as slow'' (roughly time cubic in input size) as a context-free language parser when handling these, and quite likely far slower for the worst cases. The Tcl regexp engine does not have these features. ---- Okay then - feel free to add information here on the other RE flavors available in Tcl... [LES] says that there no other RE flavors available in Tcl. Tcl only uses [ARE]. What I meant is that regular expressions may be construed as any one (or all) of its several variations, but [Regular Expressions] only discusses Tcl's [ARE]. I said that because this wiki discusses many things under several contexts, not necessarily that of Tcl, and I thought it would be good to note that, at least in this case, '''it is''' restricted to the context of Tcl. Anyone interested in a different or more ample discussion of Regular Expressions will have to look elsewhere. E.g. on [PCRE]. ---- [RS]: Well, the Wiki doesn't claim to have all info - it rather asks those in the know to contribute it :) Here's from man re_syntax: : DIFFERENT FLAVORS OF REs ====== ====== : Regular expressions (“RE”s), as defined by POSIX, come in two flavors: extended REs (“ERE”s) and basic REs (“BRE”s). EREs are roughly those of the traditional egrep, while BREs are roughly those of the traditional ed. This implementation adds a third flavor, advanced REs (“ARE”s), basically EREs with some significant extensions. ====== ====== : This manual page primarily describes AREs. BREs mostly exist for backward compatibility in some old programs; they will be discussed at the end. POSIX EREs are almost an exact subset of AREs. Features of AREs that are not present in EREs will be indicated. ---- !!!!!! %| [Category Glossary] |% !!!!!!