Perl-Compatible Regular Expressions. A superset of Regular Expressions with a few extra features introduced by Perl, less a couple of features that could not be enforced without Perl itself. Welcomed by many, hated by many others.
Note that, in spite of the name, PCRE do not exist in Perl only. Other programs and/or languages can implement them, like PHP.
JMN 2023 For me - the most notable missing feature in Tcl's regex engine is the \p feature for Unicode Properties, Scripts and Blocks.
e.g it allows \p{Sc} or \p{Currency_Symbol} which is aware that not all such symbols are in the Unicode Currency_Symbols block.
You can also do things like: \p{InHiragana}
The equivalent with Tcl's engine would be something like:
regexp {[\u3040–\u309F]} $somechar
There are many more regexes that do more than just look at a single character range e.g testing whether a character is a letter in any language and/or a particular case. Some of them can be converted to a simple case such as the Hiragana one above but to do this all properly it would ideally be maintained in the RE engine or some loadable package.
DKF - Wow. There's more Features from the Black Lagoon in there than you can shake a B-Movie at...
A big Regular Expression fan sees DKF's remark and says: Maybe Tcl uses ARE because Tcl is so merciful. More features for the bold and less features for the queasy.
Another (OK, the same) big Regular Expression fan also says: Regular Expressions are not very easy, granted, but they're also overly mystified, and PCRE take the blame for some extra mystification, even by those who are good friends with Regular Expressions.
The thing is that PCRE allow tricks that are impossible with traditional RE. Some people advocate avoiding PCRE completely and, instead, writing even more complex, long-winded, probably convoluted code to replace them.
Big Regular Expression Fan will never understand these people.
Here is a very quick summary of PCRE's most relevant features. Items marked with + are supported by ARE (thank God).
+ foo(?=bar) match "foo" only if "bar" follows it + foo(?!bar) match "foo" only if "bar" does NOT follow it (?<=foo)bar match "bar" only if "foo" precedes it (?<!foo)bar match "bar" only if "foo" does NOT precede it (?<!in|on|at)foo match "foo" only if NOT preceded by "in", "on" or "at" (?<=\d{3})(?<!999)foo match "foo" only if preceded by 3 digits other than "999" + (?i)abc case-insensitive match of abc, ABC, aBc, ABc, etc. + ab(?i)c same as above; the (?i) applies throughout the pattern (ab(?i)c) matches abc or abC; the outer parens make the difference! + (?m) multi-line pattern space: same as "s/FIND/REPL/M" + (?s) set "." to match newline also: same as "s/FIND/REPL/S" + (?x) ignore whitespace and #comments; + (?:abc)foo match "abcfoo", but do not capture 'abc' in \1 (?:ab|cd)ef match "abef" or "cdef"; only 'cd' is captured in \1 + (?#remark)xy match "xy"; remarks after "#" in the parens are ignored. (?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern) ...matches conditionally, like "if" statements. (?R) recursive match. OK, this one is really tough. \l make letters capital \L make letters small until \E \u make letters capital \U make letters capital until \E \Q escape all until \E \E end of modifyer's action \G end of previous match
Roy Terry, 22July2003: It seems that Tcl's (?=foo) and (?!foo) are equivalent to the PCRE (?<...) feature. No?
Big Regular Expression Fan, 22July2003: Not exactly.
foo(?=bar) will match "foo" only if "bar" follows it. For example:
http://(?=www\.)
The RE above will only find Web addresses whose sub domain is 'www'.
foo(?!bar) will match "foo" only if "bar" does not follow it. So, conversely,
http://(?!www\.)
...will only find Web addresses whose sub domain is NOT 'www'.
f(?<=foo)bar will match "bar" only if "foo" precedes it. For example:
(?<=http://)ftp\.
The RE above will only find Web addresses whose sub domain is 'ftp', but their protocol actually is 'http'.
(?<!foo)bar will match "bar" only if "foo" does not precede it. Therefore, conversely,
(?<!http://)ftp\.
...will only find Web addresses whose sub domain is 'ftp', expressely ruling out any one whose protocol is not 'http'.
Of course, in many cases they can be interchanged. You can say http://(?=ftp\ .) instead of (?<=http://)ftp\ .. But that may force you to look for what you really want "backwards", so to speak. I've actually encountered situations in which I could not do without "look behind assertions", but I cannot recall any of them right now. Stay tuned to this page. Breaking news at any moment.
Meanwhile, check out this beautifully formatted page: http://www.slabihoud.de/spampal/pcrepattern.html
DKF notes that look-behind assertions could probably be added to Tcl's RE package without stomping over the theoretical basis for it in any way worse than look-ahead assertions already do. But it would take some work from someone energetic...
Lars H, 2007-12-31: One thing I find a bit curious about the ARE lookaheads is that they are not limited to the subregexp in which they appear —
((?=foo)[a-z]+)o
matches "foo" with the capturing subexpression matching only "fo". This makes (?=…) different from & of the grammar_fa regexps, and similarly (?!…) different from what can be done with the grammar_fa !. I suspect the PCRE lookaheads are like the ARE ones, but you might want to check.
Info copied from Regular Expressions page:
Most common regular expression implementations (notable perl and direct derivatives of the PCRE library) exhibit poor performance in certain pathological cases. Henry Spencer's complete reimplementation as a "hybrid" engine appears to address some of those problems. See [L1 ] for some fascinating benchmarks.
Lars H: One point here, if I read the PCRE manpages correctly, is that the "alternative" matching algorithm of PCRE (FA-based, hence linear time) cannot do capturing parentheses. The paper quoted above does however contain a remark that it is perfectly possible for FA-based RE engines to do capturing parentheses.
Note on theoretical background: The classical approach to regular expressions, by way of finite automata (FA), is about checking whether a string (or "word", as it tends to be called in the theoretical literature) in its entirety matches a regular expression (like including ^ and $ in every regexp). It is fairly easy to modify the setting so that one gets a "searching" regexp instead (add .* at the beginning, stop when reaching an accepting state), but finding what subexpressions were matched is nontrivial.
More features from PCRE...
As a side note on recursive matching, that alters the language from being expressible with a Finite Automaton to requiring a full Turing Machine, and hence not an RE language any more but a full programming language with a horrible syntax. If you want that sort of power, use Tcl for real. :^D
Lars H: You're sure it doesn't just change the language into context-free or context-sensitive [L2 ]? Not that a linear-bounded non-deterministic Turing machine should be much better than the full thing, though.