Version 17 of Drawbacks of Tcl's Regexps

This page serves as a list of current (June 2003) problems with Tcl's regexp library.

Bugs in implementation of TCL_REG_CANMATCH flag, which mean the library believes '.x' could match 'abc' at position zero (which is plain wrong!). The correct answer is that it could match at position 2, if more text were available.
The implementation is completely incomprehensible and thereby totally at odds with the rest of Tcl's wonderfully readable source code. Even trying to work out what a function like 'find' or 'cfind' is supposed to do (let alone what it actually does) is practically impossible.
No backwards matching support. The only way to find matches backwards is to iteratively search forwards until you find the last match.
Tcl doesn't expose any of the low-level regexp functions in its stubs table. This means any Tcl extension which wants to run a regexp over a buffer must first place that buffer into a Tcl_Obj (with Tcl_NewUnicodeObj, most likely). This is not a satisfactory solution if the buffer is large and already in Unicode format, since placing it in a Tcl_Obj will result in it being copied. Tcl should therefore expose RegExpExecUniChar (from tclRegexp.c) in its stubs table.
The library has very good properties for pure-greedy or pure-non-greedy matching (not shared by Perl, for example, which will not always return the longest/shortest match respectively), but the rules for mixed-greediness matching are very, very unclear.
"Can't Happen" bugs, e.g. [L1 ]

So, what proposals have been submitted as TIPs to address the deficits?

Related question: who understands the code enough to be persuaded to write a TIP to address any of these items. (Since, without an implementor, any TIP would only languish)

[Compliments to Vince for this summary.]

DEL: I would restate the Unicode issue slightly differently. First, I don't think that it is any big deal that objects have to be placed into a Tcl_Obj. First because this is pretty typical throughout Tcl and second, because you can still manipulate the data in the object directly. So the overhead of setting up a Tcl_Obj ONCE is not really that big a deal. If you compare this to some other things in Tcl that also only take Tcl_Objs, regexp looks like a pretty good example for the interface. Expect, for example, lets the regexp engine run over the data as it streams and the overhead is not a problem as the Tcl_Obj creation is only done once.

That said, a real negative that I would like to bring out is that Regexp engine does not support UTF8 but requires Unicode! This can be painful because Unicode is not necessarily a sensible choice for some apps. So this choice causes shimmering in these apps. I can understand why Henry choose to support only Unicode - offsets simply aren't efficient in UTF8. I don't know what a better solution is - just because I have a complaint with the existing implementation doesn't mean it isn't the best solution :P but I'm adding it here for the record anyway.

DKF: On the last point (greedy/non-greedy) the rules are clear in a certain sense. That sense is that the first quantifier syntactically encountered in a RE determines whether the RE as a whole is greedy or not, and all quantifiers in the RE are either greedy or not greedy. There are no mixed-greediness REs, which is definitely unlike Perl. On the other hand, I've no idea what the automata-theoretic basis for a mixed-greediness RE is; the Perl code is widely accused (probably rightly) of being a horrible hack.

JE: Don't expect an automata-theoretic basis for mixed-greediness regexps. A finite state automaton either reaches an accepting state or it doesn't; there is no record of how the automaton got there. So if you're concerned about which subexpressions of an RE matched what substrings of the input, automata theory is no help.

(With a DFA, it's easy to find the longest or shortest match for a regular expression as a whole -- halt as soon as you reach a final state for the shortest match, halt when you reach the dead state for the longest one -- but when it comes to mixed-greediness regexps, a different theoretical basis is needed. I don't know what that theoretical basis might be; if anyone knows of one I'd be very interested to hear about it.)

Back to DKF:

I should note here that there is an exception to the rule on quantifier mixing, and that is for (non-zero-width) lookahead assertions, which are always non-greedy. But they're handled effectively by recursively calling the RE engine internally, and the boundary between which automata are being used is fairly obvious. They're also not very efficient, and you can't use them for subexpression capture.

Vince notes that DKF sounds suspiciously as if he understands the code and could therefore be someone to address some of these bugs...

Vince: Unfortunately, it's very hard to submit a TIP to address any of the above issues (except the addition of a function to the stubs table, which I could happily submit a TIP for), because the code is so incomprehensible that I don't know who (beyond Henry Spencer) would actually be in a position to implement any changes. That's partly why I put together the above list, to see whether discussion here would point to a way forward/out of this mess ;-).

Joe Mistachkin: After analyzing and attempting to isolate some of the existing bugs in the current regexp engine for countless hours, I agree with Vince on all points.

DKF: Me too. While I understand a lot of the theoretical basis for RE matching (producing a non-deterministic automata from an RE isn't exactly hard compared with most compilation tasks) Henry's code is not at all for the faint-hearted. It's not designed to be read by anyone other than the author, given the number of single-character variable names and interesting C-preprocessor tricks. If I had a few months uninterrupted effort available, I'm sure I could grasp what's going on, but I've not had that much totally free time since 1994...

Vince: Sounds like Henry's the only person who'll be able to help us, then. How about we collect together all of the known problems on this page (Joe alludes to some bugs, but I'm not sure what they are), and then see if someone on the TCT can get in touch with Henry...

One alternative for backward matching: reverse the string you're matching against, and construct a regular expression designed to work on a reversed string.

One alternative for replacing the existing system: implement a regexp engine in pure tcl (with someone with a good grounding in the theory -- perhaps a college professor who teaches theory of computation -- providing a strong influence) then start replacing the most inefficient pieces with compiled code. Perhaps implement several engines, for perspective (for example, both a NFA and a DFA).

Note that perl regular expressions provide a superset of regular expression functionality. Translated out of perlspeak, this means that when some of the notation is used the expressions are no longer regular. I'm not even certain they're context free. It might very well be that the replacement system doesn't have quite the same semantics as the current system.

CMcC: Looking at rxspencer-alpha3.8.g3.tar.gz here [L2 ] I see a note from (I presume) Henry Spencer:

    New in alpha3.5:  Active development of this code has been stopped
    -- I'm working on a complete reimplementation

I'm guessing that what we have in tcl is that reimplementation, since the code is very different.

Category Internals