Purpose: To document the special Regular Expressions escape sequences available in Tcl 8.x or newer.
Escapes (available only in advanced regular expressions (ARE)), which begin with a \ followed by an alphanumeric character, come in several varieties: character entry, class shorthands, constraint escapes, and back references. A \ followed by an alphanumeric character but not constituting a valid escape is illegal in AREs. In extended regular expressions (EREs), there are no escapes: outside a bracket expression, a \ followed by an alphanumeric character merely stands for that character as an ordinary character, and inside a bracket expression, \ is an ordinary character. (The latter is the one actual incompatibility between EREs and AREs.)
Character-entry escapes (AREs only) exist to make it easier to specify non-printing and otherwise inconvenient characters in REs:
Hexadecimal digits are 0'-9', a'-f', and A'-F'. Octal digits are 0'-7'.
The character-entry escapes are always taken as ordinary characters. For example, \135 is ] in ASCII, but \135 does not terminate a bracket expression. Beware, however, that some applications (e.g., C compilers) interpret such sequences themselves before the regular-expression package gets to see them, which may require doubling (quadrupling, etc.) the `\'.
Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used character classes:
Within bracket expressions, \d', \s', and \w' lose their outer brackets, and \D', \S', and \W' are illegal. (So, for example, [a-c\d] is equivalent to [a-c[:digit:]]]. Also, [a-c\D], which is equivalent to [a-c^[:digit:]]], is illegal.)
A constraint escape (AREs only) is a constraint, matching the empty string if specific conditions are met, written as an escape:
A word is defined as in the specification of [[:<:]]] and [[:>:]]] above. Constraint escapes are illegal within bracket expressions.
A back reference (AREs only) matches the same string matched by the parenthesized subexpression specified by the number, so that (e.g.) ([bc])\1 matches bb or cc but not `bc'. The subexpression must entirely precede the back reference in the RE. Subexpressions are numbered in the order of their leading parentheses. Non-capturing parentheses do not define subexpressions.
There is an inherent historical ambiguity between octal character-entry escapes and back references, which is resolved by heuristics, as hinted at above. A leading zero always indicates an octal escape. A single non-zero digit, not followed by another digit, is always taken as a back reference. A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e. the number is in the legal range for a back reference), and otherwise is taken as octal.