Regular expressions provide a very powerful method of defining a pattern, but they are a bit awkward to understand and to use properly. So let us examine some more examples in detail.
We start with a simple yet non-trivial example: finding floating-point numbers in a line of text. Do not worry: we will keep the problem simpler than it is in its full generality. We only consider numbers like 1.0 and not 1.00e+01.
How do we design our regular expression for this problem? By examining typical examples of the strings we want to match:
1.0, .02, +0., 1, +1, -0.0120
-, +., 0.0.1, 0..2, ++1
+0000 and 0001
We will accept them - because they normally are accepted and because excluding them makes our pattern more complicated.
A pattern is beginning to emerge:
The total expression is:
[-+]?[0-9]*\.?[0-9]*
At this point we can do three things:
[-+]?\d*\.?\d*
instead. Or we could decide that we want to capture the digits before and after the period for special processing:
[-+]?([0-9])*\.?([0-9]*)
You see, there is a problem with the above pattern: all the parts are optional, that is, each part can match a null string - no sign, no digits before the period, no period, no digits after the period. In other words: Our pattern can match an empty string!
Our questionable numbers, like "+000" will be perfectly acceptable and we (grudgingly) agree. But more surprisingly, the strings "--1" and "A1B2" will be accepted too! Why? Because the pattern can start anywhere in the string, so it would match the substrings "-1" and "1" respectively!
We need to reconsider our pattern - it is too simple, too permissive:
Before trying to write down the complete regular expression, let us see what different forms we have:
Now the synthesis:
(^|[ \t])([-+]?([0-9]+|\.[0-9]+|[0-9]+\.[0-9]*))($|[^+-.])
Or:
(^|[ \t])([-+]?(\d+|\.\d+|\d+\.\d*))($|[^+-.])
The parentheses are needed to distinguish the alternatives introduced by the vertical bar and to capture the substring we want to have. Each set of parentheses also defines a substring and this can be put into a separate variable:
regexp {.....} $line whole char_before number nosign char_after # # Or simply only the recognised number (x's as placeholders, the # last can be left out # regexp {.....} $line x x number
Tip: To identify these substrings: just count the opening parentheses from left to right.
If we put it to the test:
set pattern {(^|[ \t])([-+]?(\d+|\.\d+|\d+\.\d*))($|[^+-.])} set examples {"1.0" " .02" " +0." "1" "+1" " -0.0120" "+0000" " - " "+." "0001" "0..2" "++1" "A1.0B" "A1"} foreach e $examples { if { [regexp $pattern $e whole \ char_before number digits_before_period] } { puts ">>$e<<: $number ($whole)" } else { puts ">>$e<<: Does not contain a valid number" } }
the result is:
>>1.0<<: 1.0 (1.0) >> .02<<: .02 ( .02) >> +0.<<: +0. ( +0.) >>1<<: 1 (1) >>+1<<: +1 (+1) >> -0.0120<<: -0.0120 ( -0.0120) >>+0000<<: +0000 (+0000) >> - <<: Does not contain a valid number >>+.<<: Does not contain a valid number >>0001<<: 0001 (0001) >>0..2<<: Does not contain a valid number >>++1<<: Does not contain a valid number >>A1.0B<<: Does not contain a valid number >>A1<<: Does not contain a valid number
So our pattern correctly accepts the strings we intended to be recognised as numbers and rejects the others.
Let us turn to some other patterns now:
Suppose we do not know the enclosing character (it can be " or '). Then:
regexp {(["'])[^"']*\1} $string enclosed_string
will do it; the \1 is a so-called back-reference to the first captured substring.
set string "Again and again and again ..." if { [regexp {(\y\w+\y).+\1} $string => word] } { puts "The word $word occurs at least twice" }
(The pattern \y matches the beginning or the end of a word and \w+ indicates we want at least one character).
# # Use the return value of [regexp] to count the number of # parentheses ... # if { [regexp -all {(} $string] != [regexp -all {)} $string] } { puts "Parentheses unbalanced!" }
Of course, this is just a rough check. A better one is to see if at any point while scanning the string there are more close parentheses than open parentheses. We can easily extract the parentheses and put them in a list (the -inline option does that):
set parens [regexp -inline -all {[()]} $string] set balance 0 set change("(") 1 ;# This technique saves an if-block :) set change(")") -1 foreach p $parens { incr balance $change($p) if { $balance < 0 } { puts "Parentheses unbalanced!" } } if { $balance != 0 } { puts "Parentheses unbalanced!" }
Suppose you want to extract a piece of text from an HTML file, anything between two paragraph tags, <p>. One way to do that is:
regexp {<p>(.*)<p>} $html => text
We use two "<p>" tags to enclose the text we want to find, because quite often the actual closing tag, </p>, is missing - HTML is often very sloppy.
As we are interested only in the text between the tags, we use the somewhat odd variable name "=>". In Tcl that is a perfectly legal name and because it looks like an arrow it draws attention to the variable "text", the one we are really going to use.
Unfortunately, this will give us a lot more than we want: regular expressions like this match the longest substring that fits. So we get all the stuff between the first "<p>" tag and the last one. Instead we need to use a non-greedy quantifier - *? - to match the shortest substring instead:
regexp {<p>(.*?)<p>} $html => text
As a last example we are going to "commify" numbers: change numbers like 10000.00 into 10,000.00 for better readability. A concise regular expression to do so is used in the procedure commify:
proc commify {number} { regsub -all {\d(?=(\d{3})+($|\.))} $number {\0,} } puts commify 1000000.00
results in:
1,000,000.00
This regular expression contains several advanced features:
In the substitution part the substring \0, which is caught by the subexpression \d{3} in the first pair of parentheses is followed by a comma, and since that is followed by a + quantifier, each triplet of digits are replaced by the original triplet and a comma.
Note, however, that this only works if a dot is followed by not more than three digits:
% commify 1234567.123456 1,234,567.123,456
This is a rather common quirk of regular expressions: they may match more than you intended.
Finally: Regular expressions are very powerful, but they have certain theoretical limitations. One of these limitations is that they are not suitable for parsing arbitrarily nested text.
You can experiment with regular expressions using the VisualRegexp or Visual REGEXP applications.
More on the theoretical background and practical use of regular expressions (there is lots to cover!) can be found in the book Mastering Regular Expressions by J. Friedl.