This page serves as a list of current (June 2003) problems with Tcl's regexp library.
So, what proposals have been submitted as TIPs to address the deficits?
Related question: who understands the code enough to be persuaded to write a TIP to address any of these items. (Since, without an implementor, any TIP would only languish)
[Compliments to Vince for this summary.]
DKF: On the last point (greedy/non-greedy) the rules are clear in a certain sense. That sense is that the first quantifier syntactically encountered in a RE determines whether the RE as a whole is greedy or not, and all quantifiers in the RE are either greedy or not greedy. There are no mixed-greediness REs, which is definitely unlike Perl. On the other hand, I've no idea what the automata-theoretic basis for a mixed-greediness RE is; the Perl code is widely accused (probably rightly) of being a horrible hack.
JE: Don't expect an automata-theoretic basis for mixed-greediness regexps. A finite state automaton either reaches an accepting state or it doesn't; there is no record of how the automaton got there. So if you're concerned about which subexpressions of an RE matched what substrings of the input, automata theory is no help.
(With a DFA, it's easy to find the longest or shortest match for a regular expression as a whole -- halt as soon as you reach a final state for the shortest match, halt when you reach the dead state for the longest one -- but when it comes to mixed-greediness regexps, a different theoretical basis is needed. I don't know what that theoretical basis might be; if anyone knows of one I'd be very interested to hear about it.)
Back to DKF:
I should note here that there is an exception to the rule on quantifier mixing, and that is for (non-zero-width) lookahead assertions, which are always non-greedy. But they're handled effectively by recursively calling the RE engine internally, and the boundary between which automata are being used is fairly obvious. They're also not very efficient, and you can't use them for subexpression capture.
Vince notes that DKF sounds suspiciously as if he understands the code and could therefore be someone to address some of these bugs...
Vince: Unfortunately, it's very hard to submit a TIP to address any of the above issues (except the addition of a function to the stubs table, which I could happily submit a TIP for), because the code is so incomprehensible that I don't know who (beyond Henry Spencer) would actually be in a position to implement any changes. That's partly why I put together the above list, to see whether discussion here would point to a way forward/out of this mess ;-).
Joe Mistachkin: After analyzing and attempting to isolate some of the existing bugs in the current regexp engine for countless hours, I agree with Vince on all points.
DKF: Me too. While I understand a lot of the theoretical basis for RE matching (producing a non-deterministic automata from an RE isn't exactly hard compared with most compilation tasks) Henry's code is not at all for the faint-hearted. It's not designed to be read by anyone other than the author, given the number of single-character variable names and interesting C-preprocessor tricks. If I had a few months uninterrupted effort available, I'm sure I could grasp what's going on, but I've not had that much totally free time since 1994...
Vince: Sounds like Henry's the only person who'll be able to help us, then. How about we collect together all of the known problems on this page (Joe alludes to some bugs, but I'm not sure what they are), and then see if someone on the TCT can get in touch with Henry...
One alternative for backward matching: reverse the string you're matching against, and construct a regular expression designed to work on a reversed string.
One alternative for replacing the existing system: implement a regexp engine in pure tcl (with someone with a good grounding in the theory -- perhaps a college professor who teaches theory of computation -- providing a strong influence) then start replacing the most inefficient pieces with compiled code. Perhaps implement several engines, for perspective (for example, both a NFA and a DFA).
Note that perl regular expressions provide a superset of regular expression functionality. Translated out of perlspeak, this means that when some of the notation is used the expressions are no longer regular. I'm not even certain they're context free. It might very well be that the replacement system doesn't have quite the same semantics as the current system.