Solve a nasty regular expression problem?  Share your examples for the benefit of others here.  Please include the problem statement, the code and then any discussion.


** See Also **

   [Regular Expression Examples]:   The easy examples.

   [cmdSplit]:   `[wordparts]` and `[exprlex]` go heavy on the regular expressions, and in a familiar context:  Tcl Syntax.


** Modify Double Brackets **

'''Problem Statement: ''' Modify a string with a series of substrings enclosed in double square brackets ([[ ]]) such that they become only single brackets. 
Example input: 

======
[[X]] [[abc]] [[123]] 
======

Desired output: 

======
[X] [abc] [123] 
======

The actual substrings could be any arbitrarily nasty text, including such things as &, \1 etc.

'''Code: ''' 

======
regsub -all {\[\[(.*?)\]\]} $thebody {[\1]} thebody
======

'''Discussion: '''  The non-greedy expression .*? will expand all text until the first match of ]]]].  Works great.  This example comes from [Wiki Conversion OddMuse To Confluence] where you can find the entire example.

''[Lars H]: Doesn't look particularly advanced to me — the only nontrivial aspect is the use of a non-greedy quantifier. Occurencies of regexp syntax characters in the string operated on has never been a problem (but including them in the regexp pattern can be). Also, one might wonder whether the following [string map] isn't an easier way to do this:''
  string map {[[ [ ]] ]} $thebody
''Either way, the edge cases occur when $thebody contains brackets that aren't string delimiters. Possibly strings containing newlines can also be troublesome.''

----

'''Problem Statement: ''' Modify a string containing a series of substrings enclosed in double square brackets ([[[[ ]]]]) such that any spaces in that substring are converted to underscores (_).

'''Code: ''' 

======
while { [regsub -all {(\[\[[^] ]+) (.+?\]\])} $thebody {\1_\2} thebody] > 0 } {}


set idx [regexp -inline -all -indices {\[[^]]*\]} $txt]
foreach pair $idx {
  foreach {start end} $pair { break }
  set new [string map [list { } _] [string range $txt $start $end]]
  set txt [string replace $txt $start $end $new]
}
======

'''Discussion: '''  In the first line, a complete iterative approach is taken.  Where there are many more substrings than there are spaces in those substrings, this method clearly computationally efficient.  This example comes from [Wiki Conversion OddMuse To Confluence] where you can find the entire example.  If there are more spaces than substrings then, at some point, it becomes more computationally effficient to follow the second set of statements.  Those statements were constructed by another person on the Tcler's Chat.  There does not seem to be a way to solve this problem in a single step.

----

'''Problem Statement: ''' Find the longest substring starting with pattern 1, ''not'' containing pattern 2 and ending with pattern 3.

[AM] (The problem came up in the chatroom, november 2008) For example: if I have the string "This is the Tclers' Wiki", I might want to find the longest substring that starts with "T", ends with "i" and does ''not'' contain "the" (the answer would be: Tclers' Wiki). If the pattern to be excluded is a single character or anything you can put in [[..]], the problem is easy. But in this case we would accept "Tthhei" as a valid substring.

'''Code: ''' 

======
#
# I can only imagine the following steps:
#
# 1. Split the string on the excluded pattern (use splitx from Tcllib for instance)
# 2. Examine all elements of the resulting and extract the longest substring that matches "T.*i"
#
======

'''Discussion: '''  

The example came up in a discussion of what regular expressions are ''not'' good at. I claimed they are not good at ''not'' matching a pattern - a rather vague statement, of course. But thinking over this particular example, I could not come up with any RE pattern that fulfills the  requirements.

Perhaps it is possible, but then I'd like to see a demonstration.

[NEM]: This is discussed on [Regular Expression Examples] (negated strings). Using the approach [DKF] described:

======
% set pat {T(?:[^t]|t(?!he))*i}
T(?:[^t]|t(?!he))*i
% regexp -all -inline $pat $str
{This i} {Tclers' Wiki}
======

I don't know if you can get regexp to return ''only'' the longest match, though.

----

'''Problem Statement: ''' Extract the major/minor and auxiliary numbers from a string like "26.3.Q005"

[avi] This problem, like many others can be easily addressed using character classes.

'''Code: ''' 

======
set verstring "26.3.Q005"
regexp {([[:digit:]]+)\.([[:digit:]]+)\.(.*)} $verstring all major minor auxiliary
puts "$major $minor $auxiliary"
======

'''Discussion: '''

[AMG]: I replaced "quote-quoting" with {brace-quoting} to eliminate the need for using backslashes to quote brackets.  This greatly increases readability, and the readability of regular expressions needs all the help it can get. :^)  In the process I corrected a subtle bug: the [Tcl] interpreter was gobbling up all the backslashes and not passing any to the [regexp] command, and that meant regexp was getting "." atoms (match anything) instead of "\." atoms (match period).  Brace quoting is almost always the right thing for regular expressions.  By the way, embedded substitutions inside regular expressions are usually a mistake, since the substituted text is interpreted as a regular expression and not a fixed, literal string.  Always using braces to quote regular expressions will help you to remember this. :^)

Here's a shorter way to write the same regular expression.  Also this adds a ^ anchor to reject strings with junk at the beginning instead of discarding said junk.  It's not necessary to have a $ anchor at the end, since .* will greedily consume all characters to the end of the string.  I replaced "all" with "_" because that's just my personal preference for a dummy variable name.

======
regexp {^(\d+)\.(\d+)\.(.*)} $verstring _ major minor auxiliary
======

And last, this code can be written using [scan].  For simple matching and extraction, scan often outperforms regexp in both readability and efficiency.  I suggest only using regexp when scan won't do the job.

======
scan $verstring %d.%d.%s major minor auxiliary
======

[avi]: Thanks! The use of regular expression in this context was to illustrate the power of character classes - and yes, I thoroughly agree with you that in this case scan would have done the job.

----

'''Problem Statement: ''' Extract some data while throwing away other pieces that are similarly formatted.

[dcd] Here, the trailing 2 items are of interest in a stream where initial meta-data is separated by pipe characters, but the first three elements are not. An example of the input might be "BLAH|CODE|IDEA|BEEF|stuff........", and, in this case, we don't care about BLAH, CODE, or IDEA

'''Code: ''' 

======
set re {(?:(?:[^|]+)\|){3}([[:xdigit:]]{4})\|(.*)}
regexp $re $msg d meat potatoes
======

'''Discussion: '''  

This uses non-capturing RE's twice, once in the RE that defines the pipe separated field and once in the enclosing RE with an arity of exactly 3 that throws away the first three fields. It took me a few tries to realize I could use ?: twice to get the result I wanted.


*** Recommended Template ***

'''Problem Statement: ''' 

'''Code: ''' 
======

======
'''Discussion: '''  


<<categories>> String Processing