Purpose: tips and techniques for debugging problems with [regular expressions] ---- **About [regexp] -about** The [regexp] command has an -about flag that appears to provide information about the command's behavior. Hopefully more information about this option will be added here as time goes by. $ set r {a{,3}} a{,3} $ set b {aabc} aabc $ regexp $r $b 0 $ regexp $r {abc} 0 $ regexp -- $r $b 0 $ regexp -about $r $b 0 {REG_UBRACES REG_UUNSPEC} $ set r {a{0,3}} a{0,3} $ regexp -about $r $b 0 {REG_UBOUNDS REG_UEMPTYMATCH} $ regexp -- $r $b 1 ---- The values returned from regexp -about are: * A digit which indicates the number of submatches available, * A list of symbols which indicate something about the regular expression. Scrounging about `tclRegexp.c`, I see there are a number of possible symbols that '''-about''' can return. REG_UBACKREF: Contains back-references. REG_ULOOKAHEAD: Contains lookahead constraints (e.g., “`(?=`...`)`”) REG_UBOUNDS: appears to indicate that the `{''m'',''n''}` quantifier is used REG_UBRACES: appears to indicate that braces are used in a non-metacharacter manner REG_UBSALNUM: Backslash followed by (unrecognized) alphanumeric. REG_UUNSPEC also set. REG_UPBOTCH: RE engine detected that it is in a case that the POSIX spec botched (unmatched “`)`” character) REG_UBBS: Backslash in bracketed term (i.e., “`[[`...`\`...`]]`”) REG_UNONPOSIX: Not a POSIX RE. REG_UUNSPEC: Contains something not covered by the specification. REG_UUNPORT: RE is formally unportable to different character sets other than the one it was designed for (not a problem in practice; Tcl always uses [UNICODE] characters) REG_ULOCALE: Has a dependency on the locale (only one locale currently supported, so not a problem) REG_UEMPTYMATCH: Can match the empty string. REG_UIMPOSSIBLE: Cannot match anything. REG_USHORTEST: Overall non-greedy regular expression. ---- If you set up your regular expression in a Tcl variable, then you can have unintended consequences: set foo abc(def) set RE "$foo" regexp $RE $another_variable Has anybody written a filter for variables that can clean them up before sticking them in a regular expression like this? ''[Lars H]: It appears whoever wrote the above was either confused or made some fatal typo. Besides setting variables foo and RE, the above is 100% equivalent to'' regexp {abc(def)} $another_variable ---- **Mixing it up** Mixing greedy and non-greedy quantifiers might not have the results you'd expect. See Henry Spencer's reply in [http://groups.google.com/d/msg/comp.lang.tcl/FddeFPbTFw8/asoMuv7dWqIJ%|%tcl 8.2 regexp not doing non-greedy matching correctly] ,[comp.lang.tcl] ,1999-09-20. ---- [regexp] [Regular Expressions] - [Regular Expression Examples] - [Regular Expression Debugging Tips] ---- **Testing and Debugging REs** ***Visual REGEXP*** [Visual REGEXP] is a little script which helps you to debug your regexp with a "trial and error" method (get it here [http://laurent.riesterer.free.fr/regexp]). http://www.lucidway.org/Marty/Tcl/TclWikiImages/1345.png (Link "broken" 15 Sep 2005, i.e., target server requires some kind of login.) Image from Softpedia - http://linux.softpedia.com/: [http://linux.softpedia.com/screenshots/Visual-REGEXP_2.png] ---- ***A Test Script*** The following little test script can be used for testing RE's on the fly - [BBH] ====== # # regexp tester/viewer # set SubMatchColors {red blue magenta orange cyan purple green} proc clear {} { # clear old info foreach t [.txt tag names] {.txt tag remove $t 1.0 end} } proc do_re {} { clear # get matches by index set cmd [list regexp -inline -indices] if {$::LINE} {lappend cmd -line} if {$::ALL} {lappend cmd -all} lappend cmd -- $::EXP [.txt get 1.0 end] set l [eval $cmd] if {[llength $l] > 0} { # mark range of entire match set i1 "1.0 + [lindex [lindex $l 0] 0] chars" set i2 "1.0 + [expr [lindex [lindex $l 0] 1] + 1] chars" .txt tag add FullMatch $i1 $i2 # mark any submatches set modval [llength $::SubMatchColors] set num 0 set p2 -1 foreach {match} [lrange $l 1 end] { if {[lindex $match 0] < $p2} { # previous match was really a full match when -all specified # NOTE: this will also cause the outer set(s) of nested submatches # to not be highlighted in any way - an enhancement would # be to determine (by parsing the RE itself) how many subexpresions # there are, then use that to determine the true "total match" # instead of just looking for overlapping ranges, then any # nested parens can be formatted (maybe by background, or underline, # or italic, or bold, or size or ....) but you would need to # determine a set of non-canceling highlights, then keep track # of how many levels deep in a overlapping region of text you # are in and use a set of mofiiers for each level # # BUT that is too complicated for a simple little test tool # (at least for now) # # Additional NOTE: the -about flag may be of use in determine number of submatches # .txt tag add FullMatch "1.0 + $p1 chars" "1.0 + [expr $p2 + 1] chars" set num [expr ($num - 1) % $modval] } set i1 "1.0 + [lindex $match 0] chars" set i2 "1.0 + [expr [lindex $match 1] + 1] chars" .txt tag add SubMatch$num $i1 $i2 set p1 [lindex $match 0] set p2 [lindex $match 1] set num [expr ($num + 1) % $modval] } } else { tk_messageBox -message "RE doesn't match!" } } wm title . "RE Checker" label .lbl -text "Expression:" entry .exp -textvar EXP bind .exp do_re set LINE 0 set ALL 0 frame .f pack [label .f.label -text "options:"] -side left pack [checkbutton .f.line -text "-line" -variable LINE] -side left pack [checkbutton .f.all -text "-all" -variable ALL] -side left pack [button .f.doit -text "Run regexp!" -command do_re] -side left -expand 1 -fill none pack [button .f.clear -text "Reset Text" -command clear] -side left -expand 1 -fill none text .txt -background grey25 -foreground white .txt tag config FullMatch -background black -relief raised set i 0 foreach clr $SubMatchColors { .txt tag config SubMatch$i -foreground $clr incr i } grid .lbl .exp -sticky ew grid .f - -sticky ew -pady 5 grid .txt - -sticky news grid columnconfigure . 1 -weight 10 grid rowconfigure . 2 -weight 10 ====== ***Komodo*** Someone ought to explain the RE debugger available in [Komodo]. Also, it would be good to have a comparison with [Visual RegExp]. ---- ***TREV*** Check out http://www.doulos.com/knowhow/tcltk/examples/trev/ , where [TREV], the Tcl Regular Expression Visualiser, is discussed. The purpose of it is to demonstrate how a regular expression matches text. ---- ***Regex-coach*** Yet another ("sexier"?) regexp debugger appears at http://www.weitz.de/regex-coach . Note, though, that it implements ''[Perl]'s'' regexp syntax. ---- ***redet*** See [redet] for a tool to assist in developing regular expressions. ---- ***txt2regex*** See [^txt2regex$] for a tool to assist in constructing regular expressions. ---- ***regexpviewer*** I made my own, as well, located at [regexpviewer] - [davidw] ---- ***regfuzz*** https://code.google.com/p/regfuzz/%|%regfuzz%|% is a collection of program and scripts for testing regular expression robustness using randomly generated valid and invalid regular expressions. The base implementation is in C, but a Tcl interface via swig is included along with samples of its use. <> Category Debugging | Category String Processing