Regular Expressions Match Requirements

Tcl Regular Expression Match Requirements describes a set of regular expressions, and what, ideally, they would match.

See Also

issue: exact match is greedy by reluctant exact match quantifier

Description

Tcl's regular expression engine has a particular design that leads to some unexpected results. The collection below of regular expression applications on this page and their ideal results is intended as a guide for the further development of regular expression routines in Tcl.

To Do

On 2015-09-21 A significant patch to regcomp.c was contributed by the postgresql project. See Regexp backreference fail with a * closure and Fix for quantified regexp back-references .

One of premises behind the patch is that

(a*)+

can be understood as

(?:a*)*(a*)

. Due to this, the following expressions behave differently:

% regexp -indices -inline {(a*)*} aaa
{0 2} {0 2}
% regexp -indices -inline {(a*)+} aaa
{0 2} {3 2}

The difference is that in the second expression, (a*) manages to capture nothing because it matches the empty string after aaa.

Identify which of the behaviours listed below are due to this patch.

See also, Regexp quantifier issues , pgsql-bugs, 2019-11-22.

(a*)(b*?)

Ideal:

% regexp -indices -inline {(a*)(b*?)} aaaabbbb
{0 7} {0 3} {4 3}

Actual:

% regexp -indices -inline {(a*)(b*?)} aaaabbbb
{0 7} {0 3} {4 7}

(t*?)?

Ideal:

% regexp -inline -indices {(t*?)?} ttt
{0 -1} {0 -1}

Actual:

% regexp -inline -indices {(t*?)?} ttt
{0 2} {0 2}

^(a*)+$

Ideal:

% regexp -indices -inline {^(a*)+$} aaa
{0 2} {0 2}

Actual:

% regexp -indices -inline {^(a*)+$} aaa
{0 2} {3 2}

.*(a*){1,3}?

Ideal and actual:

regexp -indices -inline {.*(a*){1,3}?} aaaa
{0 3} {4 3}

(a.*?f)*

If there is a quantifier on a capturing expression, it should return a list of matches:

Ideal:

% regexp -indices -inline {(a.*?f)*} aaafaaafjkl
{0 7} {{0 3} {4 7}}

Actual:

% regexp -indices -inline {(a.*?f)*} aaafaaafjkl
{0 7} {4 7}

Ideal:

% regexp -indices -inline {(a*[^a])+} aaabbaacaa
{0 7} {5 7}

Actual:

% regexp -indices -inline {(a*[^a])+} aaabbaacaa
{0 7} {{0 3} {5 7}}

((a*)+)

Ideal:

% regexp -indices -inline {((a*)+)} aaa
{0 2} {0 2} {0 2}

Actual:

% regexp -indices -inline {((a*)+)} aaa
{0 2} {0 2} {3 2}

(?:a*b)+c

Ideal:

% regexp -indices -inline {(?:a*b)+c} aaaabbbbcc
{7 8}

Actual:

% regexp -indices -inline {(?:a*b)+c} aaaabbbbcc
{0 8}

If the First Branch is Greedy all Branches are Greedy

Ideally, the greediness of a branch would not affect another branch:

% regexp -indices -inline {z*|(a*?)(r+)} aaaarr
{0 4} {0 3} {4 4}

But currently, if the First branch is greedy all branches are greedy:

% regexp -indices -inline {z*|(a*?)(r+)} aaaarr
{0 5} {0 3} {4 5}

Page Authors

pyk