This page serves as a list of current (June 2003) problems with Tcl's regexp library.
So, what proposals have been submitted as TIPs to address the deficits?
Related question: who understands the code enough to be persuaded to write a TIP to address any of these items. (Since, without an implementor, any TIP would only languish)
[Compliments to Vince for this summary.]
DEL: I would restate the Unicode issue slightly differently. First, I don't think that it is any big deal that objects have to be placed into a Tcl_Obj. First because this is pretty typical throughout Tcl and second, because you can still manipulate the data in the object directly. So the overhead of setting up a Tcl_Obj ONCE is not really that big a deal. If you compare this to some other things in Tcl that also only take Tcl_Objs, regexp looks like a pretty good example for the interface. Expect, for example, lets the regexp engine run over the data as it streams and the overhead is not a problem as the Tcl_Obj creation is only done once.
That said, a real negative that I would like to bring out is that Regexp engine does not support UTF8 but requires Unicode! This can be painful because Unicode is not necessarily a sensible choice for some apps. So this choice causes shimmering in these apps. I can understand why Henry choose to support only Unicode - offsets simply aren't efficient in UTF8. I don't know what a better solution is - just because I have a complaint with the existing implementation doesn't mean it isn't the best solution :P but I'm adding it here for the record anyway.
DKF: On the last point (greedy/non-greedy) the rules are clear in a certain sense. That sense is that the first quantifier syntactically encountered in a RE determines whether the RE as a whole is greedy or not, and all quantifiers in the RE are either greedy or not greedy. There are no mixed-greediness REs, which is definitely unlike Perl. On the other hand, I've no idea what the automata-theoretic basis for a mixed-greediness RE is; the Perl code is widely accused (probably rightly) of being a horrible hack.
JE: Don't expect an automata-theoretic basis for mixed-greediness regexps. A finite state automaton either reaches an accepting state or it doesn't; there is no record of how the automaton got there. So if you're concerned about which subexpressions of an RE matched what substrings of the input, automata theory is no help.
(With a DFA, it's easy to find the longest or shortest match for a regular expression as a whole -- halt as soon as you reach a final state for the shortest match, halt when you reach the dead state for the longest one -- but when it comes to mixed-greediness regexps, a different theoretical basis is needed. I don't know what that theoretical basis might be; if anyone knows of one I'd be very interested to hear about it.)
Back to DKF:
I should note here that there is an exception to the rule on quantifier mixing, and that is for (non-zero-width) lookahead assertions, which are always non-greedy. But they're handled effectively by recursively calling the RE engine internally, and the boundary between which automata are being used is fairly obvious. They're also not very efficient, and you can't use them for subexpression capture.
Vince notes that DKF sounds suspiciously as if he understands the code and could therefore be someone to address some of these bugs...
Vince: Unfortunately, it's very hard to submit a TIP to address any of the above issues (except the addition of a function to the stubs table, which I could happily submit a TIP for), because the code is so incomprehensible that I don't know who (beyond Henry Spencer) would actually be in a position to implement any changes. That's partly why I put together the above list, to see whether discussion here would point to a way forward/out of this mess ;-).
Joe Mistachkin: After analyzing and attempting to isolate some of the existing bugs in the current regexp engine for countless hours, I agree with Vince on all points.
DKF: Me too. While I understand a lot of the theoretical basis for RE matching (producing a non-deterministic automata from an RE isn't exactly hard compared with most compilation tasks) Henry's code is not at all for the faint-hearted. It's not designed to be read by anyone other than the author, given the number of single-character variable names and interesting C-preprocessor tricks. If I had a few months uninterrupted effort available, I'm sure I could grasp what's going on, but I've not had that much totally free time since 1994...
Vince: Sounds like Henry's the only person who'll be able to help us, then. How about we collect together all of the known problems on this page (Joe alludes to some bugs, but I'm not sure what they are), and then see if someone on the TCT can get in touch with Henry...
One alternative for backward matching: reverse the string you're matching against, and construct a regular expression designed to work on a reversed string.
One alternative for replacing the existing system: implement a regexp engine in pure tcl (with someone with a good grounding in the theory -- perhaps a college professor who teaches theory of computation -- providing a strong influence) then start replacing the most inefficient pieces with compiled code. Perhaps implement several engines, for perspective (for example, both a NFA and a DFA).
Note that perl regular expressions provide a superset of regular expression functionality. Translated out of perlspeak, this means that when some of the notation is used the expressions are no longer regular. I'm not even certain they're context free. It might very well be that the replacement system doesn't have quite the same semantics as the current system.
CMcC: Looking at rxspencer-alpha3.8.g3.tar.gz here [L2 ] I see a note from (I presume) Henry Spencer:
New in alpha3.5: Active development of this code has been stopped -- I'm working on a complete reimplementation
I'm guessing that what we have in tcl is that reimplementation, since the code is very different.