Advanced Regular Expression Examples

Solve a nasty regular expression problem? Share your examples for the benefit of others here. Please include the problem statement, the code and then any discussion.

Modify Double Brackets

Problem Statement: Modify a string with a series of substrings enclosed in double square brackets ([ ]) such that they become only single brackets. Example input:

[[X]] [[abc]] [[123]]

Desired output:

[X] [abc] [123]

The actual substrings could be any arbitrarily nasty text, including such things as &, \1 etc.

Code:

regsub -all {\[\[(.*?)\]\]} $thebody {[\1]} thebody

Discussion: The non-greedy expression .*? will expand all text until the first match of ]]]. Works great. This example comes from Wiki Conversion OddMuse To Confluence where you can find the entire example.

Lars H: Doesn't look particularly advanced to me — the only nontrivial aspect is the use of a non-greedy quantifier. Occurencies of regexp syntax characters in the string operated on has never been a problem (but including them in the regexp pattern can be). Also, one might wonder whether the following string map isn't an easier way to do this:

  string map {[[ [ ]] ]} $thebody

Either way, the edge cases occur when $thebody contains brackets that aren't string delimiters. Possibly strings containing newlines can also be troublesome.

Problem Statement: Modify a string containing a series of substrings enclosed in double square brackets ([[ ]]]) such that any spaces in that substring are converted to underscores (_).

Code:

while { [regsub -all {(\[\[[^] ]+) (.+?\]\])} $thebody {\1_\2} thebody] > 0 } {}



set idx [regexp -inline -all -indices {\[[^]]*\]} $txt]
foreach pair $idx {
  foreach {start end} $pair { break }
  set new [string map [list { } _] [string range $txt $start $end]]
  set txt [string replace $txt $start $end $new]
}

Discussion: In the first line, a complete iterative approach is taken. Where there are many more substrings than there are spaces in those substrings, this method clearly computationally efficient. This example comes from Wiki Conversion OddMuse To Confluence where you can find the entire example. If there are more spaces than substrings then, at some point, it becomes more computationally effficient to follow the second set of statements. Those statements were constructed by another person on the Tcler's Chat. There does not seem to be a way to solve this problem in a single step.

Problem Statement: Find the longest substring starting with pattern 1, not containing pattern 2 and ending with pattern 3.

AM (The problem came up in the chatroom, november 2008) For example: if I have the string "This is the Tclers' Wiki", I might want to find the longest substring that starts with "T", ends with "i" and does not contain "the" (the answer would be: Tclers' Wiki). If the pattern to be excluded is a single character or anything you can put in [..], the problem is easy. But in this case we would accept "Tthhei" as a valid substring.

Code:

#
# I can only imagine the following steps:
#
# 1. Split the string on the excluded pattern (use splitx from Tcllib for instance)
# 2. Examine all elements of the resulting and extract the longest substring that matches "T.*i"
#

Discussion:

The example came up in a discussion of what regular expressions are not good at. I claimed they are not good at not matching a pattern - a rather vague statement, of course. But thinking over this particular example, I could not come up with any RE pattern that fulfills the requirements.

Perhaps it is possible, but then I'd like to see a demonstration.

NEM: This is discussed on Regular Expression Examples (negated strings). Using the approach DKF described:

% set pat {T(?:[^t]|t(?!he))*i}
T(?:[^t]|t(?!he))*i
% regexp -all -inline $pat $str
{This i} {Tclers' Wiki}

I don't know if you can get regexp to return only the longest match, though.

PYK 2016-01-09: Here is an example, very much along the same lines, of obtaining all strings that begin with pattern1, include pattern2, and end with pattern3. Because a negative lookahead constraint can't be followed by a quantifier, it is enclosed in a non-capturing set of parenthesis, which is then followed by the non-greedy quantifier.

regexp -all -inline -- {pattern1(?:(?!pattern3).)*?pattern2.*?pattern3} $somvevalue

AMG: Marking a negative lookahead constraint as optional strikes me as meaningless. Just say pattern1.*pattern2.*pattern3 and be done.

This is backwards from what is requested by the problem statement, by the way. The discussion is intended to explore the difficulty of crafting regular expressions that reject a subexpression, when the occurrence of said subexpression is not narrowly restricted.

PYK 2016-01-11: That would be true if the optional quantifier applied directly to the negative lookahead constraint, but then it would be illegal as well. In in this case, the optional thing is . , not preceeded by pattern3, and it makes a real difference. With the more simple pattern1.*pattern2.*pattern3 , pattern3 might occur between pattern1 and pattern2, and avoiding that is the main problem in the problem statement. In that sense, I don't think it's backward from the original problem at all, and I don't know any other way to solve the problem using a regular expression. This example is also interesting because it illustrates how, in certain cases, one might work around the restriction that a constraint may not be followed by a quantifier.

Problem Statement: Extract the major/minor and auxiliary numbers from a string like "26.3.Q005"

avi This problem, like many others can be easily addressed using character classes.

Code:

set verstring "26.3.Q005"
regexp {([[:digit:]]+)\.([[:digit:]]+)\.(.*)} $verstring all major minor auxiliary
puts "$major $minor $auxiliary"

Discussion:

AMG: I replaced "quote-quoting" with {brace-quoting} to eliminate the need for using backslashes to quote brackets. This greatly increases readability, and the readability of regular expressions needs all the help it can get. :^) In the process I corrected a subtle bug: the Tcl interpreter was gobbling up all the backslashes and not passing any to the regexp command, and that meant regexp was getting "." atoms (match anything) instead of "\." atoms (match period). Brace quoting is almost always the right thing for regular expressions. By the way, embedded substitutions inside regular expressions are usually a mistake, since the substituted text is interpreted as a regular expression and not a fixed, literal string. Always using braces to quote regular expressions will help you to remember this. :^)

Here's a shorter way to write the same regular expression. Also this adds a ^ anchor to reject strings with junk at the beginning instead of discarding said junk. It's not necessary to have a $ anchor at the end, since .* will greedily consume all characters to the end of the string. I replaced "all" with "_" because that's just my personal preference for a dummy variable name.

regexp {^(\d+)\.(\d+)\.(.*)} $verstring _ major minor auxiliary

And last, this code can be written using scan. For simple matching and extraction, scan often outperforms regexp in both readability and efficiency. I suggest only using regexp when scan won't do the job.

scan $verstring %d.%d.%s major minor auxiliary

avi: Thanks! The use of regular expression in this context was to illustrate the power of character classes - and yes, I thoroughly agree with you that in this case scan would have done the job.

Problem Statement: Extract some data while throwing away other pieces that are similarly formatted.

dcd Here, the trailing 2 items are of interest in a stream where initial meta-data is separated by pipe characters, but the first three elements are not. An example of the input might be "BLAH|CODE|IDEA|BEEF|stuff........", and, in this case, we don't care about BLAH, CODE, or IDEA

Code:

set re {(?:(?:[^|]+)\|){3}([[:xdigit:]]{4})\|(.*)}
regexp $re $msg d meat potatoes

Discussion:

This uses non-capturing RE's twice, once in the RE that defines the pipe separated field and once in the enclosing RE with an arity of exactly 3 that throws away the first three fields. It took me a few tries to realize I could use ?: twice to get the result I wanted.

Recommended Template

Problem Statement:

Code:

Discussion:

Category String Processing