Version 19 of Regular Expression

Updated 2008-05-18 11:37:02 by dkf

The wiki page Regular Expressions will take you to a place where Tcl's ARE are discussed.

Technically, a regular expression is an ingenious notation (that was invented by Kleene in the 1950's) for specifying a regular language. (A "language" in this terminology is a set of strings, very often an infinite sets of strings. See http://en.wikipedia.org/wiki/Regular_language for formal definitions of regular language.)

Practically, a regular expression looks like a summary of the various forms that the set of strings it describes may take. One conventionally uses | as a separator between alternative branches of the expression. One may also write * (the Kleene star) after something to denote that it may be repeated zero or more times, and use parentheses to group things. A regular expression using all these things is

  A((b|cc)a)*

which denotes a language consisting of the strings

   A Aba Acca Ababa Abacca Accaba Accacca Abababa

and many others.

An important factor for the popularity of regular expressions is their linear-time complexity: when a given string is to be matched against a regular expression, it is possible to do it so that every character in the string is only looked at once. This is attained by compiling the regular expression into a finite automaton — potentially a big chunk of work, but one that only needs to be done once for each regular expression — and then running the automaton with the string as input.

Another important factor is that regular expressions can be used for efficiently searching through a large body of text. A direct implementation of the above would produce an algorithm for matching a string against a regular expression, but most RE implementations — among them regexp — play a few tricks internally that make them operate in search mode instead. In order to get matching behaviour (often useful with switch -regexp), one uses the anchors ^ and $ to require that the particular position in the regular expression must correspond to the beginning and end of the string respectively (caveat: sometimes it is beginning and end of line instead; AREs have \A and \Z as alternatives).

Other common extensions to the regular expression syntax, which however doesn't make them any more powerful than the basic set described above, are:

? quantifier
Something is optional. The RE “(RE)?” is however equivalent to “(RE|)”.

Okay then - feel free to add information here on the other RE flavors available in Tcl...

LES says that there no other RE flavors available in Tcl. Tcl only uses ARE. What I meant is that regular expressions may be construed as any one (or all) of its several variations, but Regular Expressions only discusses Tcl's ARE. I said that because this wiki discusses many things under several contexts, not necessarily that of Tcl, and I thought it would be good to note that, at least in this case, it is restricted to the context of Tcl. Anyone interested in a different or more ample discussion of Regular Expressions will have to look elsewhere. E.g. on PCRE.


RS: Well, the Wiki doesn't claim to have all info - it rather asks those in the know to contribute it :) Here's from man re_syntax:

"DIFFERENT FLAVORS OF REs

Regular expressions (“RE”s), as defined by POSIX, come in two flavors: extended REs (“ERE”s) and basic REs (“BRE”s). EREs are roughly those of the traditional egrep, while BREs are roughly those of the traditional ed. This implementation adds a third flavor, advanced REs (“ARE”s), basically EREs with some significant extensions.

This manual page primarily describes AREs. BREs mostly exist for backward compatibility in some old programs; they will be discussed at the end. POSIX EREs are almost an exact subset of AREs. Features of AREs that are not present in EREs will be indicated."