Regular Expressions Match Requirements

Difference between version 2 and 3 - Previous - Next
'''Tcl Regular Expression Match Requirements'''  describes a set of [regular
expressions], and what, ideally, they would match.


** See Also **

   issue: [https://core.tcl-lang.org/tcl/info/0e0e150e49%|%exact match is greedy by reluctant exact match quantifier]:   



** Description **

Tcl's regular expression engine has a particular design that leads to some
unexpected results.  The collection below of regular expression applications on
this page and their ideal results is intended as a guide for the further
development of regular expression routines in Tcl.



** To Do **

On 2015-09-21 [https://core.tcl-lang.org/tcl/info/c8dfe06653dbef5d%|%A
significant patch] to `regcomp.c` was contributed by the [postgresql] project.
See  [https://core.tcl-lang.org/tcl/info/1115587%|%Regexp backreference fail
with a * closure] and [https://core.tcl-lang.org/tcl/info/0e0e150e49%|%Fix for
quantified regexp back-references].

One of premises behind the patch is that

======none
(a*)+
======

can be understood as

======none

(?:a*)*(a*)
======

.  Due to this, the following expressions behave differently:


======none
% regexp -indices -inline {(a*)*} aaa
{0 2} {0 2}
% regexp -indices -inline {(a*)+} aaa
{0 2} {3 2}
======

The difference is that in the second expression, `(a*)` manages to capture
nothing because it matches the empty string after `aaa`.

Identify which of the behaviours listed below are due to this patch. 

See also, [https://www.postgresql.org/message-id/16133-a8934caee4e53035%40postgresql.org%|%Regexp quantifier issues], pgsql-bugs, 2019-11-22.




** `(a*)(b*?)` **

Ideal: 

======none
% regexp -indices -inline {(a*)(b*?)} aaaabbbb
{0 7} {0 3} {4 3}
======


Actual:

======none
% regexp -indices -inline {(a*)(b*?)} aaaabbbb
{0 7} {0 3} {4 7}
======



** `(t*?)?` **

Ideal:

======none
% regexp -inline -indices {(t*?)?} ttt
{0 -1} {0 -1}
======

Actual:

======none
% regexp -inline -indices {(t*?)?} ttt
{0 2} {0 2}
======



** `^(a*)+$` **

Ideal:

======none
% regexp -indices -inline {^(a*)+$} aaa
{0 2} {0 2}
======

Actual: 

======none
% regexp -indices -inline {^(a*)+$} aaa
{0 2} {3 2}
======



** `.*(a*){1,3}?` **

Ideal and actual:

======none
regexp -indices -inline {.*(a*){1,3}?} aaaa
{0 3} {4 3}
======



** `(a.*?f)*` **

If there is a quantifier on a capturing expression, it should return a list of
matches:

Ideal: 

======none
% regexp -indices -inline {(a.*?f)*} aaafaaafjkl
{0 7} {{0 3} {4 7}}
======

Actual:

======none
% regexp -indices -inline {(a.*?f)*} aaafaaafjkl
{0 7} {4 7}
======

Ideal: 

======none
% regexp -indices -inline {(a*[^a])+} aaabbaacaa
{0 7} {5 7}
======

Actual:

======
% regexp -indices -inline {(a*[^a])+} aaabbaacaa
{0 7} {{0 3} {5 7}}
======



** `((a*)+)` **

Ideal:

======none
% regexp -indices -inline {((a*)+)} aaa
{0 2} {0 2} {0 2}
======

Actual: 

======none
% regexp -indices -inline {((a*)+)} aaa
{0 2} {0 2} {3 2}
======



** `(?:a*b)+c` **

Ideal:

======none
% regexp -indices -inline {(?:a*b)+c} aaaabbbbcc
{7 8}
======


Actual:

======none
% regexp -indices -inline {(?:a*b)+c} aaaabbbbcc
{0 8}
======



** If the First Branch is Greedy all Branches are Greedy **


Ideally, the greediness of a branch would not affect another branch:

======none
% regexp -indices -inline {z*|(a*?)(r+)} aaaarr
{0 4} {0 3} {4 4}
======

But currently, if the First branch is greedy all branches are greedy:

======none
% regexp -indices -inline {z*|(a*?)(r+)} aaaarr
{0 5} {0 3} {4 5}
======



** Page Authors **

   [pyk]:   






<<categories>> regular expressions