This page serves as a list of current (June 2003) problems with Tcl's regexp library.

   * Bugs in implementation of TCL_REG_CANMATCH flag, which mean the library believes '.x' ''could match'' 'abc' at position zero (which is plain wrong!).  The correct answer is that it ''could match'' at position 2, if more text were available.

   * The implementation is completely incomprehensible and thereby totally at odds with the rest of Tcl's wonderfully readable source code.  Even trying to work out what a function like 'find' or 'cfind' is supposed to do (let alone what it actually does) is practically impossible.

   * No backwards matching support.  The only way to find matches backwards is to iteratively search forwards until you find the last match.

   * Tcl doesn't expose any of the low-level regexp functions in its stubs table.  This means any Tcl extension which wants to run a regexp over a buffer must first place that buffer into a Tcl_Obj (with Tcl_NewUnicodeObj, most likely).  This is not a satisfactory solution if the buffer is large and already in Unicode format, since placing it in a Tcl_Obj will result in it being copied.  Tcl should therefore expose ''RegExpExecUniChar'' (from tclRegexp.c) in its stubs table.

   * The library has very good properties for pure-greedy or pure-non-greedy matching (not shared by Perl, for example, which will not always return the longest/shortest match respectively), but the rules for mixed-greediness matching are very, very unclear.

So, what proposals have been submitted as [TIP]s to address the deficits?

Related question: who understands the code enough to be persuaded to write a TIP to address any of these items.  (Since, without an implementor, any TIP would only languish)

----
[[Compliments to Vince for this summary.]]


----
[DKF]: On the last point (greedy/non-greedy) the rules are clear in a certain sense.  That sense is that the first quantifier syntactically encountered in a RE determines whether the RE as a whole is greedy or not, and all quantifiers in the RE are either greedy or not greedy.  There are no mixed-greediness REs, which is definitely unlike Perl.  On the other hand, I've no idea what the automata-theoretic basis for a mixed-greediness RE is; the Perl code is widely accused (probably rightly) of being a horrible hack.

I should note here that there is an exception to the rule on quantifier mixing, and that is for (non-zero-width) lookahead assertions, which are always non-greedy.  But they're handled effectively by recursively calling the RE engine internally, and the boundary between which automata are being used is fairly obvious.  They're also not very efficient, and you can't use them for subexpression capture.
----

[Vince]: Unfortunately, it's very hard to submit a [TIP] to address any of the above issues (except the addition of a function to the stubs table, which I could happily submit a TIP for), because the code is so incomprehensible that I don't know who (beyond Henry Spencer) would actually be in a position to implement any changes.  That's partly why I put together the above list, to see whether discussion here would point to a way forward/out of this mess ;-).

----

[Joe Mistachkin]: After analyzing and attempting to isolate some of the existing bugs in the current regexp engine for countless hours, I agree with Vince on all points.

----

One alternative for backward matching: reverse the string you're matching against, and construct a regular expression designed to work on a reversed string.

One alternative for replacing the existing system: implement a regexp engine in pure tcl (with someone with a good grounding in the theory -- perhaps a college professor who teaches theory of computation -- providing a strong influence) then start replacing the most inefficient pieces with compiled code.  Perhaps implement several engines, for perspective (for example, both a NFA and a DFA).

Note that perl regular expressions provide a superset of regular expression functionality.  Translated out of perlspeak, this means that when some of the notation is used the expressions are no longer regular.  I'm not even certain they're context free.  It might very well be that the replacement system doesn't have quite the same semantics as the current system.

----
[Category Internals]