Version 5 of Advanced Regular Expression Examples

Updated 2008-11-26 19:35:07 by nem

Solve a nasty regular expression problem? Share your examples for the benefit of others here. Please include the problem statement, the code and then any discussion.

See Regular Expression Examples for the easy examples.


Problem Statement: Modify a string with a series of substrings enclosed in double square brackets ([ ]) such that they become only single brackets. Example input:

  [[X]] [[abc]] [[123]] 

Desired output:

  [X] [abc] [123] 

The actual substrings could be any arbitrarily nasty text, including such things as &, \1 etc.

Code:

    regsub -all {\[\[(.*?)\]\]} $thebody {[\1]} thebody

Discussion: The non-greedy expression .*? will expand all text until the first match of ]]]. Works great. This example comes from Wiki Conversion OddMuse To Confluence where you can find the entire example.

Lars H: Doesn't look particularly advanced to me — the only nontrivial aspect is the use of a non-greedy quantifier. Occurencies of regexp syntax characters in the string operated on has never been a problem (but including them in the regexp pattern can be). Also, one might wonder whether the following string map isn't an easier way to do this:

  string map {[[ [ ]] ]} $thebody

Either way, the edge cases occur when $thebody contains brackets that aren't string delimiters. Possibly strings containing newlines can also be troublesome.


Problem Statement: Modify a string containing a series of substrings enclosed in double square brackets ([[ ]]]) such that any spaces in that substring are converted to underscores (_).

Code:

    while { [regsub -all {(\[\[[^] ]+) (.+?\]\])} $thebody {\1_\2} thebody] > 0 } {}

OR

    set idx [regexp -inline -all -indices {\[[^]]*\]} $txt]
    foreach pair $idx {
      foreach {start end} $pair { break }
      set new [string map [list { } _] [string range $txt $start $end]]
      set txt [string replace $txt $start $end $new]
    }

Discussion: In the first line, a complete iterative approach is taken. Where there are many more substrings than there are spaces in those substrings, this method clearly computationally efficient. This example comes from Wiki Conversion OddMuse To Confluence where you can find the entire example. If there are more spaces than substrings then, at some point, it becomes more computationally effficient to follow the second set of statements. Those statements were constructed by another person on the Tcler's Chat. There does not seem to be a way to solve this problem in a single step.


Problem Statement: Find the longest substring starting with pattern 1, not containing pattern 2 and ending with pattern 3.

AM (The problem came up in the chatroom, november 2008) For example: if I have the string "This is the Tclers' Wiki", I might want to find the longest substring that starts with "T", ends with "i" and does not contain "the" (the answer would be: Tclers' Wiki). If the pattern to be excluded is a single character or anything you can put in [..], the problem is easy. But in this case we would accept "Tthhei" as a valid substring.

Code:

#
# I can only imagine the following steps:
#
# 1. Split the string on the excluded pattern (use splitx from Tcllib for instance)
# 2. Examine all elements of the resulting and extract the longest substring that matches "T.*i"
#

Discussion:

The example came up in a discussion of what regular expressions are not good at. I claimed they are not good at not matching a pattern - a rather vague statement, of course. But thinking over this particular example, I could not come up with any RE pattern that fulfills the requirements.

Perhaps it is possible, but then I'd like to see a demonstration.

NEM: This is discussed on Regular Expression Examples (negated strings). Using the approach DKF described:

% set pat {T(?:[^t]|t(?!he))*i}
T(?:[^t]|t(?!he))*i
% regexp -all -inline $pat $str
{This i} {Tclers' Wiki}

I don't know if you can get regexp to return only the longest match, though.


*** Recommended Template

Problem Statement:

Code:

Discussion: