Regexp engine cleanup

This is one of the GSoC 2009 Projects.


The Regexp engine of the TCL language should be improved in following ways:

  • Fixed character width: The code assumes the text being searched is a C-vector of chrs, i.e., the bytewidth of characters is fixed (in normal builds to 2 bytes), despite regular expression matching in principle being content with sequential access. One disadvantage of this is that data to be matched frequently has to be converted from Tcl's primary UTF-8 string representation to a monowidth representation. Another disadvantage is that this blocks some approaches for extending Tcl's Unicode support to characters beyond the BMP. Therefore, the width of a character should be dynamic.
  • Convert to tcl coding style: When originally incorporated into the Tcl core, further upstream development of the regexp engine was expected, and so it was admitted despite not adhering to the Tcl Style Guide and not being as readable as the Tcl core in general. Today there is no upstream development, so it should be brought in line with the rest.
  • Implement stream interface: Make it possible to run the engine on streams of characters being delivered by a callback.
  • Implement lookbehind constraints: The engine supports lookahead constraints (?=...), but not lookbehind constraints (?<=...). It should support both.
  • Implement reversion: The reverse of a regular language is also a regular language, so there is a theoretical foundation for an RE syntax or option meaning "this regexp is to be read backwards". Reversion has a practical application in backwards searches.
  • Improved performance and/or memory usage: To be specified

Schedule:

Start date (May 23)

May 30 - Getting used to the code/Rewrite the code to use tcl's coding style

June 6 - Getting used to the code/Rewrite the code to use tcl's coding style

June 13 - Change "Fixed character width is assumed"

June 20 - Implement stream interface

June 27 - Implement lookbehind constraints

July 4 - Implement regexp reversion

July 11 - Improve performance and/or memory usage

July 18 - Improve performance and/or memory usage

July 25 - Improve performance and/or memory usage

August 1 - Improve performance and/or memory usage

August 8 - Improve performance and/or memory usage

End date (August 17)

Project blog:

TCL regex engine cleanup blog (AMG: 404)

Patches:

Commenting and code style patch

About me:

Daniel Klöck's portfolio

Discussion:

Is there any reason not to use PCRE? Larry Smith - There already is a discussion running (see [L1 ]) Daniel Klöck

JH I have done the PCRE integration work already, see [L2 ] which is a functional patch (needs to be updated for a couple known minor bugs). PCRE is an interesting option, but not an end-all-be-all as it isn't 100% compatible with the current Spencer RE.

One performance issue to note is non-greedy matching where submatch data must be returned, e.g. [L3 ]