Purpose: To document the special [Regular Expressions] escape sequences available in Tcl 8.x or newer.

----


Escapes (available only in advanced regular expressions (ARE)), which begin with a \ followed by an alphanumeric character, come in several varieties:  character entry, class shorthands, constraint escapes, and back references. A \ followed by an alphanumeric character but not constituting a valid escape is illegal in AREs.  In extended regular expressions (EREs), there are no escapes: outside a bracket expression, a \ followed by an alphanumeric character merely stands for that character as an ordinary character, and inside a bracket expression, \ is an ordinary character. (The latter is the one actual incompatibility between EREs and AREs.)

Character-entry escapes (AREs only) exist to make it easier to specify non-printing and otherwise inconvenient characters in REs:

   \a:   alert (bell) character, as in C

   \b:   backspace, as in C

   \B:   synonym for \ to help reduce backslash doubling in some applications where there are multiple levels of backslash processing

   \cX:   (where X is any character) the character whose low-order 5 bits are the same as those of X, and whose other bits are all zero

   \e:   the character whose collating-sequence name is `ESC', or failing that, the character with octal value 033

   \f:   formfeed, as in C

   \n:   newline, as in C

   \r:   carriage return, as in C

   \t:   horizontal tab, as in C

   \uwxyz:   (where wxyz is exactly four hexadecimal digits) the  Unicode character U+wxyz in the local byte ordering

   \Ustuvwxyz:   (where stuvwxyz is exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode extension to 32 bits

   \v:   vertical tab, as in C are all available.

   \xhhh:   (where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0xhhh (a single character no matter how many hexadecimal digits are used).

   \0:   the character whose value is 0

   \xy:   (where xy is exactly two octal digits, and is not a back reference (see below)) the character whose octal value is  0xy

   \xyz:   (where xyz is exactly three octal digits, and is not a  back reference (see below)) the character whose octal value is 0xyz

Hexadecimal digits are `0'-`9', `a'-`f', and `A'-`F'.  Octal digits are `0'-`7'.

The character-entry escapes are always taken as ordinary characters.  For example, \135 is ]] in ASCII, but \135 does not terminate a bracket expression.  Beware, however, that some applications (e.g., C compilers) interpret such sequences themselves before the regular-expression package gets to see them, which may require doubling (quadrupling, etc.) the `\'.

Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used character classes:

   \d    [[[[:digit:]]]]

   \s    [[[[:space:]]]]

   \w    [[[[:alnum:]]_]] (note underscore)

   \D    [[^[[:digit:]]]]

   \S    [[^[[:space:]]]]

   \W    [[^[[:alnum:]]_]] (note underscore)

Within bracket expressions, `\d', `\s', and `\w' lose their outer brackets, and `\D', `\S', and `\W' are illegal.  (So, for example, [[a-c\d]] is equivalent to [a-c[:digit:]].  Also, [a-c\D], which is equivalent to [[a-c^[[:digit:]]]], is illegal.)

A constraint escape (AREs only) is a constraint, matching the empty string if specific conditions are met, written as an escape:

   \A    matches only at the beginning of the string (see MATCHING, below, for how this differs from `^')

   \m    matches only at the beginning of a word

   \M    matches only at the end of a word

   \y    matches only at the beginning or end of a word

   \Y    matches only at a point that is not the beginning or end of a word

   \Z    matches only at the end of the string (see MATCHING, below, for how this differs from `$')

   \m   (where m is a nonzero digit) a back reference, see below

   \mnn   (where m is a nonzero digit, and nn is some more digits, and the decimal value mnn is not greater than the number of closing capturing parentheses seen so far) a back reference, see below

   \why:   doesn't it format correctly?

A word is defined as in the specification of [[[[:<:]]]]  and [[[[:>:]]]] above.  Constraint escapes are illegal within bracket expressions.

A back reference (AREs only) matches the same string matched by the parenthesized subexpression specified by the number, so that (e.g.)  ([[bc]])\1 matches bb or cc but not `bc'.  The subexpression must entirely precede the back reference in the RE.  Subexpressions are numbered in the order of their leading parentheses. Non-capturing parentheses do not define subexpressions.

There is an inherent historical ambiguity between octal character-entry escapes and back references, which is resolved by heuristics, as hinted at above.  A leading zero always indicates an octal escape.  A single non-zero digit, not followed by another digit, is always taken as a back reference.  A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e. the number is in the legal range for a back reference), and otherwise is taken as octal.

----
[Category Documentation]