Tcl Regular Expression Match Requirements describes a set of regular expressions, and what, ideally, they would match.
Tcl's regular expression engine has a particular design that leads to some unexpected results. The collection below of regular expression applications on this page and their ideal results is intended as a guide for the further development of regular expression routines in Tcl.
On 2015-09-21 A significant patch to regcomp.c was contributed by the postgresql project. See Regexp backreference fail with a * closure and Fix for quantified regexp back-references .
One of premises behind the patch is that
(a*)+
can be understood as
(?:a*)*(a*)
. Due to this, the following expressions behave differently:
% regexp -indices -inline {(a*)*} aaa {0 2} {0 2} % regexp -indices -inline {(a*)+} aaa {0 2} {3 2}
The difference is that in the second expression, (a*) manages to capture nothing because it matches the empty string after aaa.
Identify which of the behaviours listed below are due to this patch.
See also, Regexp quantifier issues , pgsql-bugs, 2019-11-22.
Ideal:
% regexp -indices -inline {(a*)(b*?)} aaaabbbb {0 7} {0 3} {4 3}
Actual:
% regexp -indices -inline {(a*)(b*?)} aaaabbbb {0 7} {0 3} {4 7}
Ideal:
% regexp -inline -indices {(t*?)?} ttt {0 -1} {0 -1}
Actual:
% regexp -inline -indices {(t*?)?} ttt {0 2} {0 2}
Ideal:
% regexp -indices -inline {^(a*)+$} aaa {0 2} {0 2}
Actual:
% regexp -indices -inline {^(a*)+$} aaa {0 2} {3 2}
Ideal and actual:
regexp -indices -inline {.*(a*){1,3}?} aaaa {0 3} {4 3}
If there is a quantifier on a capturing expression, it should return a list of matches:
Ideal:
% regexp -indices -inline {(a.*?f)*} aaafaaafjkl {0 7} {{0 3} {4 7}}
Actual:
% regexp -indices -inline {(a.*?f)*} aaafaaafjkl {0 7} {4 7}
Ideal:
% regexp -indices -inline {(a*[^a])+} aaabbaacaa {0 7} {5 7}
Actual:
% regexp -indices -inline {(a*[^a])+} aaabbaacaa {0 7} {{0 3} {5 7}}
Ideal:
% regexp -indices -inline {((a*)+)} aaa {0 2} {0 2} {0 2}
Actual:
% regexp -indices -inline {((a*)+)} aaa {0 2} {0 2} {3 2}
Ideal:
% regexp -indices -inline {(?:a*b)+c} aaaabbbbcc {7 8}
Actual:
% regexp -indices -inline {(?:a*b)+c} aaaabbbbcc {0 8}
Ideally, the greediness of a branch would not affect another branch:
% regexp -indices -inline {z*|(a*?)(r+)} aaaarr {0 4} {0 3} {4 4}
But currently, if the First branch is greedy all branches are greedy:
% regexp -indices -inline {z*|(a*?)(r+)} aaaarr {0 5} {0 3} {4 5}