The '''regular expression''' notation was invented by Kleene in the 1950's for specifying a regular language. This page is about regular expressions in general, not necessarily Tcl [regular Expressions] in Tcl. ** Resources ** [http://regexplorer.sourceforge.net/%|%RegExplorer]: ** Description ** A '''regular expression''' uses a particular notation to describe a ''pattern'' that a string of characters might match. Some algorithms are easier coded without regular expressions. Tcl's [string] command is versatile, and often simplifies problems many programmers hit with the RE hammer. To begin with, here are some definitions: '''atom''': a grouping of 0 or more characters which are to be treated as one entity by subsequent metacharacters, etc. If more than one character is to be included in an atom, then some sort of grouping syntax is needed. This is by analogy with the physical meaning of ''atom'': an indivisible building block that underlies all other structures. Each atom is an individual regular expression, and can also be composed into larger expressions. '''bracket expression''': also called a character class, is a set of characters enlosed in brackets. It matches any of the characters in the brackets. For example, `[[a-z]` matches can be any lower case alphabetic character from `a` to `z`. `[[^a-z]` matches any ASCII character ''except'' a lower case alphabetic character. The bracket expression can contain various ranges of characters. '''branch''': `ABC|DEF` matches either `ABC` or `DEF`. '''escape''': an escape is one of a small set of characters preceeded by \ '''metasyntax''': a sequence of `(?`metachars`)` where ''metachars'' is one or more alphabetic characters affecting the rest of the regular expressions. ''[[metasyntax is one place where a variety of examples would be useful!]]'' '''quantifier''': A metacharacter appended to a regular expression atom to indicate the number of times that atom may exist. A regular expression atom is made of either a '''literal character''' or a '''metacharacter'''. A literal character is the simplest regular expression possible. For example, `a` is a one-character regular expression. It can be used to match a portion of any string which contains the letter `a`. Search for a pattern matching the regular expression `a` in the string `abc`, and you get a match. Search for a pattern matching the regular expression `a` in the string `xyz` and you do not get a match. A metacharacter, also known as an '''interpreted character''', is intepreted as some non-literal pattern. For instance, the metacharacter `.` means 'match any character'. Search for the regular expression `.` in any string one character or longer and you get a match. Another metacharacter is the `\` (backslash). which indicates that the next character should be treated literally, even if it would normally be a metacharacter. This comes in most helpfully when attempting to describe patterns containing metacharacters. A regular expression looks like a summary of the various forms that the set of strings it describes may take. `|` usually delimits alternative branches of the expression. `*` (the ''Kleene star'') indicates zero or more occurances of the previous character, and use parentheses to group sub-expressions. All of these contsructs appear in the following example: ======none A((b|cc)a)* ====== The following strings are from the infinite set of matches for this expression: ======none A Aba Acca Ababa Abacca Accaba Accacca Abababa ====== An important factor for the popularity of regular expressions is their linear-time complexity: when a given string is to be matched against a regular expression, it is possible to do it so that every character in the string is only looked at once. This is attained by compiling the regular expression into a [finite automaton] — potentially a big chunk of work, but one that only needs to be done once for each regular expression — and then running the automaton with the string as input. Another important factor is that regular expressions can be used for efficiently ''searching'' through a large body of text. A direct implementation of the above would produce an algorithm for ''matching'' a string against a regular expression, but most RE implementations, including [[`[regexp]`], play a few tricks internally that make them operate in search mode instead. In order to get matching behaviour (often useful with [switch] -regexp), one uses the ''anchors'' `^` and `$` to require that the particular position in the regular expression must correspond to the beginning and end of the string respectively (caveat: sometimes it is beginning and end of line instead; [ARE]s have `\A` and `\Z` as alternatives). Other common extensions to the regular expression syntax, which however doesn't make them any more powerful than the basic set described above, are: One-or-more quantifier (`+`): Similar to `*`, but excluding the case of no repetition. The RE “`(`''re''`)+`” is equivalent to “`(`''re''`)(`''re''`)*`” and “`(`''re''`)*(`''re''`)`”. Optional quantifier (`?`): Also called the zero-or-one quantifier. The RE “`(`''re''`)?`” is equivalent to “`(`''re''`|)`”. Bracket expression (`[[`''chars''`]`): Effectively a shorthand — e.g. `[[abcd]` is equivalent to `(a|b|c|d)` — but often far more compact. Unicode character classes often include thousands of character, so enumerating all of them would be unfeasible. Counted quantifiers (`{`''n''`}` or `{`''m''`,`''n''`}`): Exactly ''n'' occurrencies, or at least ''m'' and at most ''n'' occurrencies, respectively. Can as `?` be reduced to combinations of parentheses and `|`, but requires repeating the core RE the quantifier is applied to at least ''n'' times. AND: Boolean "and", like `|` is boolean "or". Uncommon in regexp engines, and not easily reduced to the fundamental operations, but nonetheless the intersection of two regular languages is again a regular language. The [grammar_fa] package uses `&` to denote this. NOT: Boolean negation. Uncommon in regexp engines, and not easily reduced to the fundamental operations, but nonetheless the complement of a regular language is also a regular language. The [grammar_fa] package uses `!` to denote this. Reversal: Change the direction in which the string is being matched; this may be useful to implement searching backwards in a text editor. There is no common syntax for this, but by hand transforming a regular expression accordingly is typically straightforward. A somewhat intriguing class of such extensions are the ''constraints'', which only match the empty string but refuse to do so unless the material surrounding the position of this match satisfies some condition. Here we find: Positive lookahead (`(?=`''re''`)`): The regular expression ''re'' must match the text that follows after this position. This is similar to AND, but different in that ''re1'' and ''re2'' in `((?=`''re1''`)`''re2''`)` are not required to match the same characters; ''re1'' can match a prefix of what ''re2'' matches, or vice versa. Negative lookhead (`(?!`''re''`)`): The regular expression ''re'' must ''not'' match the text that follows after this position. Positive lookbehind (`(?<=`''re''`)`): The regular expression ''re'' must match the text that comes before this position. This is not available in [ARE]s. Negative lookbehind (`(?> Glossary | Concept