[ET] 2021-06-07 - (3.14) The following little flex/lex proc was derived from a post on comp.lang.tcl by Christian Gollwitzer. ** Intro to Flask ** Flask is flex in a bottle. It is a function: Rules + Data -> Syntax Tree Rules: These are similar to those in flex or lex which depends on regular expressions. Data: A text string, as might be read from a file. Syntax Tree: The tree is a simple 2 level list of lists with types and range of matched text. The function will scan though the '''data '''string and break it into sub-strings by using text matching '''rules'''. Each sub-string is given a '''type '''code from the rule label. The rules use the powerful Tcl regular expressions and [['''regexp''']] which includes capture groups when using a flask callback [[proc]]. The tree leaves are called '''tokens '''and consist of 3 items in a list. Tokens: A 3 element list {type start end} The '''type ''' is the label from the matching rule. The '''start ''' and '''end ''' indices can be used directly with [[string range '''data''' ... ]] to extract the sub-strings. * token(start end) => [string range $data $start $end] => text sub string The minimal structure of the tree allows for simple grouping or sectioning. In the simplest case, the tree reduces to a single ordered list of tokens to be used with a higher level parser. Flask can be useful for a quick parse of simple files or those "little languages" that Brian Kernighan once wrote about. Flask is fully dynamic since the rules are not compiled before hand. Rules can invoke callbacks which can extend flask by using the uplevel command. The section of examples below presents some useful techniques. ** What's in the Code Section ** The top of the code section below contains 2 procs, flask and a display tool. After that are some original comments and a sample set of rules and a data set (a small cad STEP file). And finally, there are calls to flask and the display tool as a demonstration. The code block can be directly copy/pasted into a console window or linux tclsh window to run the demo. For the best demo result, you will want a wide window, of about 120 chars. ** Status and Acknowledgments ** Flask is (as of june 2021) just ~ 90 lines of pure tcl code. It uses no global or namespace variables or packages. It's a testament to tcl that this can be done with so little code. I hope it is the kind of smallish tool that RS would approve of. His code has been a treasure trove of techniques I have used here and elsewhere. And of course, thanks to Christian who crafted the original version that I had so much fun working with. Thanks for stopping by. <>User Guide ** Flask ** ====== Flask is a functional tool. It has 2 inputs and 1 output. It's based on the lex/flex tool. and derived from a post on comp.lang.tcl by Christian Gollwitzer. ====== ** Calling flask ** ====== Flask takes 2 required arguments and 3 optional: Flask regextokens data {flush yes} {debug no} {indent 3} 1. regextokens This is a list of 4x elements, arranged in a matrix, of N x 4 with N rows and 4 columns (no limit on rows). All 4 columns must be present. The cells of the matrix are described below along with an example. 2. data This is a text string that represents the data to be parsed. If it came from a file, then it is simply all the text in the file as though read in in a singe [read $iochannel] statement. 3. flush This is an optional argument, with a default value of true. If there is any extra text beyond the last eos token (the one that terminates the scan) then it/they will be flushed as tokens if this is true the default. 4. debug If this is true, then at each token a [puts] is output with info about the token type, regex, position in the input data and 20 chars of data. To see the exact match, one can also use a puts action with ${$} to see matched text. 5. indent how much to indent debug lines. useful if callbacks output other data Output flask returns a list of sections. Each section is a list of tokens. And each token is a list of 3 elements, an ID and 2 indices. So it's a 3 level structure. For example: ====== ** Returned Tokens - structure ** ====== { {ID 0 3} {String 4 10} } { {ID 10 13} {String 14 20} } -token ---token-- --token-- --token------ --------section--------- --------section--------- ----------------------------return----------------------------- The Token ID is a string describing a token type; the 2 indices: start and end are used to extract the token from $data, using [string range $data $start $end] ====== ** Flask Rule matrix ** ====== The regextokens matrix has 4 columns, as shown below. 1 2 3 4 tokenID Regex action a comment Column 1 is a label to indicate which of the regular expressions in column 2 matched during the scan of the data. Column 3 is an action to take when a match occurs, and column 4 is a comment. The comment column is required, but can be just an empty string. It's part of the matrix (ok, it's really a list) but is not used for anything. However, be aware that the usual rules for balancing braces etc. need to be considered. ====== ** Example Rule Matrix ** ====== tokenID Regex action a comment set regextokens { WS+C {[\s,]+} skip "skip over whitespace and commas" ID {#\d+\s*=\s*} {token {puts "the ID was '${$}'"}} "an ID Token and a callback" String {'(''|[^'])*'} token "a string in single quotes" Zeugs {[^;()', ]+} token "stuff" LP {\(} token "left paren" RP {\)} token "Right paren" SEMI {;} eos+token "final result at end of section" } ====== ** Flask processing algorithm ** ====== The regular expressions will be tried one at a time from top to bottom starting at the first position in the input data text string. When it finds a match, it looks to see what the actions are for that RE pattern, and after it performs the action, and invokes any provided callback, it shifts the input pointer over past the matched text and starts over at the first rule looking for another match. This proceeds until there is no more data in the string OR when there is no match possible. If the last rule is simply a . then it can be used as a catchall rule and any included callbacks will be executed. Often this is a call to an error routine. Note: all regular expression will have the \A prepended so it's not needed in the rules. UNLESS the RE begins with the metasyntax, which must be at the front (?xyz) so then it is inserted after the meta. The actions can be any one of these words: skip - will match the regexp and move the scanning pointer past the token token - will match the regexp and create a token to be found in the result eos - this is the end of section token, but is NOT output to the token list eos+token - this is the same as eos, except a token WILL be output new - this is a start section that begins a new section (or section) new+token - this is a start section that begins a new section THEN outputs the token When using the 2 new actions there will be an empty list element at the front which can be ignored using a lrange $result 1 end Any other text in the action field will be the same as a skip, which can facilitate commenting out the action. A good choice might be to put a / at the front. Note however, that it is really just a skip, and if there is a callback, it WILL be called. Any rule may be commented out by having the first column, tokenID begin with a / which means that the RE won't be tested, as the whole rule is simply bypassed. The action can also be a pair of items in a string list. The first must be one of the above actions, and the second is a callback item. Whatever text is matched can be accessed using ${$}. Here is an example action with a callback: {token {callback ${$} } } This will output a token for this rule and call the callback routine passing it the text that was matched. ====== ** Sample Flask Call ** ====== set result [flask $regextokens $data] ;# parse and return tokens into result displaytokens $result $data ;# debugging print out and example for traversing the result ====== <> <>Code ====== # used to debug the output proc displaytokens {tokens data {dump 1} {indent 0}} { set l 0 ;# section number set n 0 ;# count of tokens returned, if dump 0, only count em foreach line $tokens { incr l if { $dump } { if { $::tcl_platform(platform) eq "windows" } { puts stderr "[string repeat " " $indent]$l $line" } else { puts "[string repeat " " $indent]$l $line" } } foreach token $line { lassign $token id from to if { $dump } { puts [string repeat " " $indent][format " %-17s -> |%s|" $token [string range $data $from $to]] } incr n # if { [incr count] > 100 } { ;# use this to limit output for a large file # return # } # update } } return $n } # description follows, along with an example grammar spec proc flask {regextokens data {flush yes} {debug no} {indent 3}} { ;# indent is for debug in case of recursive calls # rpos is the running position where to read the next token set rpos 0 set result {} set resultline {} set eos 0 set newtokens [list] # copy the input rules and prefix a \A to the beginning of each r.e. unless metasyntax at the front, then we insert foreach {key RE actionlist comment} $regextokens { if [regexp {\A(\(\?[bceimnpqstwx]+\))(.*)} $RE -> meta pattern] { lappend newtokens $key "$meta\\A$pattern" $actionlist $comment ;# insert the \A after the meta } else { lappend newtokens $key "\\A$RE" $actionlist $comment } } while true { set found false foreach {key RE actionlist comment} $newtokens { if { [string index $key 0] eq "/" } { ;# comments begin with / continue } if {[regexp -indices -start $rpos $RE $data match cap1 cap2 cap3 cap4 cap5 cap6 cap7 cap8 cap9]} { lassign $match start end if { $debug } { ;# map newlines to a unicode char, use stderr to colorize the matched portion (windows only) set v1 [string range $data $rpos [expr { $rpos+$end-$start }]] set v2 [string range $data [expr { $rpos+$end-$start+1 }] $rpos+40] regsub -all {\n} $v1 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v1 ;# or 21B2 regsub -all {\n} $v2 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v2 ;# or 21B2 puts -nonewline [format {%s%-10s %-40s (%4d %4d) |} [string repeat " " $indent] $key $RE $rpos $end] if { $::tcl_platform(platform) eq "windows" } { puts -nonewline stderr "\U2507$v1\U2507" } else { puts -nonewline "\U2507$v1\U2507" } puts "$v2|" # update } set action [lindex $actionlist 0] ;# if a list, action first, then callback if { $action eq "token" } { lappend resultline [list $key {*}$match] } elseif {$action eq "eos+token"} { lappend resultline [list $key {*}$match] set eos 1 } elseif { $action eq "eos" } { set eos 1 } elseif { $action eq "new+token" } { lappend result $resultline set resultline [list] lappend resultline [list $key {*}$match] } elseif { $action eq "new" } { lappend result $resultline set resultline [list] } if { [llength $actionlist] > 1 } { set callback [lindex $actionlist 1] set $ [string range $data $start $end] eval $callback } set rpos [expr {$end+1}] ;# shift set found true break } } if {$found} { # minimal bottom up parsing # for Token designated as eos end line/section if {$eos} { lappend result $resultline set resultline {} set eos 0 # puts "end of section" } } else { # nothing matched any longer if { $resultline ne {} && $flush} { lappend result $resultline } # puts "Parsing stopped" break } } return $result } # sample run: # tokenID Regex action a comment set regextokens { WS+C {[\s,]+} skip "skip over whitespace and commas" ID {#\d+\s*=\s*} {token {puts "the ID was '${$}'"}} "Token Id and a callback" String {'(''|[^'])*'} token "a string in single quotes" Zeugs {[^;()', ]+} token "stuff" LP {\(} token "left paren" RP {\)} token "Right paren" SEMI {;} eos+token "final result at end of section" } # sample data to parse, from a STEP file set data { ISO-10303-21; HEADER; FILE_DESCRIPTION (( 'STEP AP214' ), '1' ); FILE_NAME ('Airp'' lane_ {V1};$ {} [] ; \ ,"-".STEP', '2019-11-26T14:28:03', ( '' ), ( '' ), 'SwSTEP 2.0', 'SolidWorks 2010', '' ); FILE_SCHEMA (( 'AUTOMOTIVE_''DESIGN' )); ENDSEC; DATA; #1 = CARTESIAN_POINT ( 'NONE', ( -3397.537578589738600, -40.70728434983968900, -279.1044191236024400 ) ) ; #2 = CARTESIAN_POINT ( 'NONE', ( 3983.737298227797500, 1647.263135894628500, 772.3224850880964100 ) ) ; #3 = CARTESIAN_POINT ( 'NONE', ( -457.2417019049098600, 5240.945876103178300, 87.77828949283561100 ) ) ; #4 = CARTESIAN_POINT ( 'NONE', ( -1338.327255407125900, -7674.784143274568100, 415.3493082692564800 ) ) ; ENDSEC; END-ISO-10303-21; extra junk at end } # ---------- some minimal tests --- #set data {foobar;baz;} ;# #set data {foobar} #set data {} # ---------- load data file -------- #set io [open d:/smaller.step r] #set data [read $io] #close $io # ---------- run data ------------- set result [flask $regextokens $data yes yes] displaytokens $result $data ====== <>Examples Flask can do one level of sectioning. This results in a list of lists. Uses include parsing statements or lines. There are 2 ways to create a section. * A rule begins a '''new '''section ('''new '''and '''new+token''') before the token is added * A rule '''ends '''the current section ('''eos '''and '''eos+token''') after the token is added If no new sections are ever created, then all tokens output will be in the first and only section. An [ '''lindex $result 0 '''] is used to get that single list from flask's return tree. Then the program can do its own more detailed parse on the ordered token stream. Each matched sub-string can also have an action '''callback'''. The matched text can be accessed with variable '''${$}'''. A program can also be designed to not output any tokens, but just use callbacks for all matches. **Examples** The following trivial data file, a take off on the windows configuration format, will be used with 3 styles of simple parsing and one style that uses the capturing groups of an RE. The data to parse is below and is stored in the variable '''data ''' which will be used in the below examples. It is up to the program to load it however it wishes, which is a feature of flask, it separates the parsing/scanning from any I/O operations, which is more difficult to do in tools like flex. This file has sections starting at the '''square brackets''' and individual items follow that are '''name=value''' pairs. The sections end with a ; and there can be comments with # at the beginning of a line. It's the sort of simple configuration a program might choose over more complex formats such as xml. ====== set data { [Auto] # comments here Updates=1 News=1 ; [Request] Updates=1 News=1 ; [epp] Auto=1398 Version=812 File=\\\\tcl8.6.8\\generic\\tclStrToD.c ; } ====== * Style 1. A single linear list of tokens This is the simplest way to use flask. It does not use sectioning and is a true lexical scanner. The output would be used by a parser in much the way yacc uses lex (bison and flex). ====== set regextokens { COMMENT {#[^\n]*} {skip {puts "comment is '${$}'"}} " " WS {[\s]+} skip "skip over whitespace" SECT {\[[a-zA-Z]+\]} token "a section header" ITEM {[a-zA-Z0-9]+=[^\n]*} {token {puts "the ITEM was '${$}'"}} "ITEM Token and a callback" SEMI {;} skip "" } ;# this produces a single list of tokens ====== Comments, whitespace, and the semicolon are parsed, but simply skipped. Running flask as shown below, will do puts callbacks to output the matched text. The output using the displaytokens procedure follows. ====== %set result [flask $regextokens $data yes no] ;#flush any extra, no debug comment is '# comments here' the ITEM was 'Updates=1' the ITEM was 'News=1' the ITEM was 'Updates=1' the ITEM was 'News=1' the ITEM was 'Auto=1398' the ITEM was 'Version=812' the ITEM was 'File=\\\\tcl8.6.8\\generic\\tclStrToD.c' %displaytokens $result $data 1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40} {SECT 44 52} {ITEM 54 62} {ITEM 64 69} {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139} SECT 2 7 -> |[Auto]| ITEM 25 33 -> |Updates=1| ITEM 35 40 -> |News=1| SECT 44 52 -> |[Request]| ITEM 54 62 -> |Updates=1| ITEM 64 69 -> |News=1| SECT 73 77 -> |[epp]| ITEM 79 87 -> |Auto=1398| ITEM 89 99 -> |Version=812| ITEM 101 139 -> |File=\\\\tcl8.6.8\\generic\\tclStrToD.c| The 1 above is the section number and is 10 tokens. % llength $result 1 % llength [lindex $result 0] 10 ====== * Style 2 using an end of section ('''eos''') This method takes advantage of the semicolons in the data as an end of section indicator. The only difference from the previous one is that the SEMI rule action is '''eos'''. This causes it to finish off the current sub-list on each semicolon and begin a new empty one. ====== set regextokens { COMMENT {#[^\n]*} {skip {puts "comment is '${$}'"}} " " WS {[\s]+} skip "skip over whitespace" SECT {\[[a-zA-Z]+\]} token "a section header" ITEM {[a-zA-Z0-9]+=[^\n]*} {token {puts "the ITEM was '${$}'"}} "ITEM Token and a callback" SEMI {;} eos "semi colon for end" } ;# this produces 2 levels, sections and lines by using a end of section token, a SEMI ====== And the output: ====== %set result [flask $regextokens $data yes no] ;#flush any extra, no debug ... same as above ... %displaytokens $result $data 1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40} SECT 2 7 -> |[Auto]| ITEM 25 33 -> |Updates=1| ITEM 35 40 -> |News=1| 2 {SECT 44 52} {ITEM 54 62} {ITEM 64 69} SECT 44 52 -> |[Request]| ITEM 54 62 -> |Updates=1| ITEM 64 69 -> |News=1| 3 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139} SECT 73 77 -> |[epp]| ITEM 79 87 -> |Auto=1398| ITEM 89 99 -> |Version=812| ITEM 101 139 -> |File=\\\\tcl8.6.8\\generic\\tclStrToD.c| % llength $result 3 % llength [lindex $result 0] 3 % llength [lindex $result 1] 3 % llength [lindex $result 2] 4 ====== * Style 3. Using '''new''' sections This method would be used if there were no semicolons to indicate the end of a section, but rather when some token begins a new section. The '''new''' or '''new+token''' actions are for this purpose. When this method is used, there will always be one null section at the very beginning of the output. This can be removed using a [lrange $result 1 end] or just ignored. There are other ways to get around this, but likely not worth the trouble. See for example the use of state variables discussed in the advanced section. ====== set regextokens { COMMENT {#[^\n]*} {skip {puts "comment is '${$}'"}} " " WS {[\s]+} skip "skip over whitespace" SECT {\[[a-zA-Z]+\]} {new+token {puts "Section was '${$}'"}} "a section header" ITEM {[a-zA-Z0-9]+=[^\n]*} {token {puts " ITEM was '${$}'"}} "ITEM Token and a callback" SEMI {;} skip "semi colon for end" } ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning ====== And the results: ====== %set result [flask $regextokens $data yes no] ;#flush any extra, no debug ... same as above ... %displaytokens $result $data 1 2 {SECT 2 7} {ITEM 25 33} {ITEM 35 40} SECT 2 7 -> |[Auto]| ITEM 25 33 -> |Updates=1| ITEM 35 40 -> |News=1| 3 {SECT 44 52} {ITEM 54 62} {ITEM 64 69} SECT 44 52 -> |[Request]| ITEM 54 62 -> |Updates=1| ITEM 64 69 -> |News=1| 4 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139} SECT 73 77 -> |[epp]| ITEM 79 87 -> |Auto=1398| ITEM 89 99 -> |Version=812| ITEM 101 139 -> |File=\\\\tcl8.6.8\\generic\\tclStrToD.c| % llength $result 4 % llength [lindex $result 0] 0 % llength [lindex $result 1] 3 % llength [lindex $result 2] 3 % llength [lindex $result 3] 4 ====== If the section starting tokens aren't required, say if they are simply a constant text, like "begin" or "data" one can use the action '''new''' instead of '''new+token''' to have them not included at the beginning of the sections. * Style 4 using callbacks The use of capture groups with callbacks is often easier than trying to write a rule for each of the pieces and have flask build a token for each match. It is most useful when you need to describe a rule that has several components that must be all present. In a sense, this provides a bit of higher level parsing, from a lexical scanner. * Minimal parsing considered harmful Flex does not seem to support capture groups. In a Flex manual FAQ it actually frowns on having a scanner do something that seems a bit like parsing saying one should use a true parser of the yacc class instead. That is always possible, since flask can be used as a simple tokenizer, and the parser would only use a single, non-sectioned, token list. * Multiple alternatives and longest alternative match With the | operator, capture groups can be used to indicate which alternate matched. In that case, tcl REs will choose the longest (or first of equals, left/right) to be the match. This can simulate how flex does it's parallel matching. Flask has an extra level of control since it can also have priority by grouping alternatives in separate rules. * Capture groups and callbacks The next example presents the grammar rules for our sample config file that uses a callback to further parse text and then injects tokens into the token stream itself. This example also demonstrates a method to handle syntax errors. If flask cannot find a match in the file at some point, it will return as though it had reached the end of the data. To handle that case one can provide a catchall rule that should match anything and throw an error in the callback. This will match any characters and spit out up to 10 chars of context. Notice the call to do_item in the ITEM rule. The callback below demonstrates using RE capture groups. ====== set regextokens { COMMENT {#[^\n]*} {skip {#puts "comment is '${$}'"}} " " WS {[\s]+} skip "skip over whitespace" SECT {\[[a-zA-Z]+\]} {new+token {#puts "Section was '${$}'"}} "a section header" ITEM {([a-zA-Z0-9]+)(\s*=)?([^\n]*)} {skip {do_item ${$}}} "var=value to parse in callback" SEMI {;} skip "semi colon for end" ERROR {.{1,10}} {skip {error "error at '${$}'"}} "catch all is an error if we get here" } ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning ====== This method with capture groups makes it easier to split apart portions of a matched text. It uses the uplevel command to retrieve the capture information. and then it builds tokens to inject into the stream. See the advanced section for details on these cap* variables and others. In this method, all the rules except SECT use the skip action and all other tokens are created from the callbacks. ====== proc do_item {item} { ;# proof of concept 1: to further parse the p=v using capture groups set cap1 [uplevel {set cap1}] ;# capture group 1 is the label set cap3 [uplevel {set cap3}] ;# capture group 3 is everything after the first = # puts "cap1= |$cap1| cap3= |$cap3| " uplevel lappend resultline [list [list parm {*}$cap1]] ;# standard method for injecting a token uplevel lappend resultline [list [list value {*}$cap3]] ;# note the 2 uses of the list command } # This method uses capturing groups and the code uses cap1 and cap3. The cap* variables have the # right indices and so makes it much easier. The number of cap groups has now been increased to 9. # The rule uses a lazy quantifier on white-space after a label, if present, up to (but only 1) equal sign. # Note, if a capture groups would capture a null string (say using a* and there were no a's) then # regexp returns a start value at the point where the string would have started with an end # index 1 less than the start value. When doing a string range, this will return a null string # so all is good ====== Here's what that looks like. Notice the difference, in the earlier examples, without the cap* groups, the entire line was matched and produced only a single token. In that case the program using the tokens would have to do some string manipulation on the string or do it's own split up. With this method, the tokens are easily traversed, two at a time for each pair. Either way would work, it's a matter of taste. ====== 1 2 {SECT 2 7} {parm 25 31} {value 33 33} {parm 35 38} {value 40 40} SECT 2 7 -> |[Auto]| parm 25 31 -> |Updates| value 33 33 -> |1| parm 35 38 -> |News| value 40 40 -> |1| 3 {SECT 44 52} {parm 55 61} {value 63 63} {parm 66 69} {value 71 71} SECT 44 52 -> |[Request]| parm 55 61 -> |Updates| value 63 63 -> |1| parm 66 69 -> |News| value 71 71 -> |1| 4 {SECT 75 79} {parm 81 84} {value 86 89} {parm 91 97} {value 99 101} {parm 103 106} {value 108 142} SECT 75 79 -> |[epp]| parm 81 84 -> |Auto| value 86 89 -> |1398| parm 91 97 -> |Version| value 99 101 -> |812| parm 103 106 -> |File| value 108 142 -> |\\\\tcl8.6.8\\generic\\tclStrToD.c| ====== ** Advanced topics and Callbacks** ====== Callbacks and flask local variables. Typically a callback will be used to invoke a procedure. When this proc is called, it is running 1 level below flask and so with a very simple use of [uplevel], all the local variables of flask can be accessed from the callback. Here are some things you can do in a callback. Get a list of all the local variables in flask from the callback # --------------- set info [uplevel {info local}] # --------------- Retrieve the values of some and write them to the screen with puts # --------------- foreach v {rpos key RE actionlist comment match start end cap1 cap2 cap3 cap4} { ;# retrieve most useful set val [uplevel "set $v"] ;# get each of the above variables values puts [format { %-10s = %s} $v |$val| ] } set data [uplevel {set data}] ;# get the input data but don't list it, too big # --------------- To access individual variables, one can use a set statement like so to get a variable's value and giving them each the same name locally # --------------- set start [uplevel {set start}] ;# get start and store in callback local of same name set end [uplevel {set end}] ;# get end and do the same # --------------- Here are the variables that can be accessed and their uses: # -------------------------------------------------------------------------------- data The string variable, unchanging, that was passed into flask rpos This is the running position (as an index) into data key These are the 4 columns of the current rule that has matched RE actionlist comment match The current match, a pair of indices, a string range on this pair will access the matched data start The first index in match end The second index in match cap1 If any capturing groups, up to 9 are found in these 9 variables, -1 -1 for ones not set cap2 cap3 cap4 ... cap9 The above are mostly for retreiving information about the current match, while the following are used by flask to build the token lists. These are used to inject tokens into the stream from a callback resultline This is the current sections (originally a section was a line) list of tokens result This is the list of lists, at each section end, resultline is added to result and resultline is cleared # -------------------------------------------------------------------------------- When there are capturing groups used in the regex (portions of the RE in ()'s) the portion of the capture will be found in cap1..cap9. The groups that were not present in the RE will be assigned the values -1 -1 A callback can add tokens to the list that is being built up. A token is always a 3 element list which is comprised of a {type start end} triad. The 2 indices in the cap* variables are in a list of their own as a {start end} pair. To build a new token, from a pair of indices, and give it the type "mytoken" one could do this with the first capture group: # --------------- # note the need here for [list [list ...]] this is because uplevel will use concat which undoes the outer one uplevel lappend resultline [list [list mytoken {*}$cap1]] ;# add a token {mytoken cap-start cap-end} # --------------- Note that to be useful, the indices should be relative to the text found in the data variable. If one does some parsing of the matched string, say with a [string first] or [regex] statement that returns indices, then one would typically need to add the offset found in the start variable when using them to build a token. After a match, and AFTER the callback is made, the variable rpos is updated to point to the next position in the text (in data) this is done as follows: set rpos [expr {$end+1}] ;# shift So, it is possible, to modify rpos indirectly from the callback by knowing it will be updated (immediately) after the return is made from the callback. This could be accomplished by modifying the variable "end" before returning from the callback. # --------------- Starting a new section # --------------- uplevel {lappend result $resultline} ;# append the current section list to the result uplevel {set resultline {}} ;# clear the current section Note, this could result in an empty section depending on the length of resultline. # --------------- Rejecting a match # --------------- A callback can also return with a [return -code continue] which will cause rpos to not be updated, and the match to be rejected, so the foreach loop that is iterating over the REs will continue instead of starting over at the top. For example, suppose one has this rule at the top test {^test(.*?)\n} {skip {do_test ${$}}} "testing a continue" and do_test is this: proc do_test {arg} { set cap1 [uplevel {set cap1}] ;# retrieve first capture group set start [lindex $cap1 0] set end [lindex $cap1 1] puts "found test with |$arg| and cap = |$cap1|" if { ($end - $start + 1) > 5} { return -code continue } return } In this example, the callback retrieves the current match's capture group 1 indices (start end) and checks for a length greater than 5 and if so, it rejects the match. # --------------- Saving state # --------------- While flask does not directly support flex/lex start conditions, these can be implemented by saving some state inside flask (or if you don't like that, use global variables etc.) proc do_count {args} { set counter [uplevel {incr counter}] ;# will set it to 1 if it does not exist yet if { $counter <= 1 } { puts stderr "what to do first time only = $counter" } else { puts stderr "what to do 2..nth time here = $counter" } } The above could be used in a callback to detect if this rule has been matched before and perhaps ignore it or do something different. One could also optionally do a [return -code continue] to let another rule take a shot at it. There's no end of possibilities here, but you must be sure you are not modifying a flask variable unintentionally. tcl presents many interesting capabilities here that few other languages have (and none others that I know of). Have fun with this! And of course, all of this assumes that flask will not be modified, so all bets are off if one does that. On the other hand, it's a small proc and the source code is provided. ====== ** Debugging ** Flask has a tracing feature that is useful for debugging your rules. It's turned on with the 4th parameter to the flask proc. For example, set result [[flask $rules $data yes yes 3]] When calling flask the final 3 parameters are optional. To set the 4th and 5th will also require a value for the 3rd. The 5th parameter provides a way to indent the output to make it easier to see any other output that might occur from callbacks. It defaults to 3. The included displaytokens also now has an indent parameter. Here is a sample display of the output: ====== WS \A[\s]+ ( 0 1) |┊⤶⤶┊[Auto]⤶# comments here that are very lo| SECT \A\[[a-zA-Z]+\] ( 2 7) |┊[Auto]┊⤶# comments here that are very long| WS \A[\s]+ ( 8 8) |┊⤶┊# comments here that are very long indee| COMMENT \A#[^\n]* ( 9 72) |┊# comments here that are very long indeed will go past the limit┊| comment is '# comments here that are very long indeed will go past the limit' WS \A[\s]+ ( 73 73) |┊⤶┊Updates=2⤶News=2⤶;⤶[Request]⤶ Updates=1⤶| ITEM \A([a-zA-Z0-9]+)(\s*=)?([^\n]*) ( 74 82) |┊Updates=2┊⤶News=2⤶;⤶[Request]⤶ Updates=1⤶ | WS \A[\s]+ ( 83 83) |┊⤶┊News=2⤶;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶| ITEM \A([a-zA-Z0-9]+)(\s*=)?([^\n]*) ( 84 89) |┊News=2┊⤶;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶[| WS \A[\s]+ ( 90 90) |┊⤶┊;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶[epp]⤶A| ====== The first column is the token code, then the regular expression (notice the \A had been added automatically) followed by the '''start '''and '''end '''indices of the data matched. Next is the text where the rule started its match inside vertical bars, but also the match itself is inside a pair of unicode vertical 4 piece bars. With the windows console (or when enabled on linux - using the wiki code here [https://wiki.tcl-lang.org/page/console+for+Unix]), those 2 bars plus the text that matched will be colored red. All newlines are mapped to a unicode char for better visibility of the text. All rules will be output, even if they don't produce tokens. And any callbacks that output text will be interspersed as well. Of course since you have the source code, you can just find the [[format]] statements and change the column sizes if you want some other values. ** Extra - longest match RE with alternatives ** The tools flex and lex do pattern matching amongst the matching rules in parallel. If multiple rules can match at some point in the data, then the '''longest '''one is chosen (or first of equal sized matches). Flask is different as it tries its rules one at a time and the '''first '''match wins, not the longest. However, Tcl regex's that use the '''alternative operator |''' will chose the '''longest''' match. So, with a callback, Flask can produce the same results, although likely not as fast. Below we have some data, some rules, and a callback. Notice that the action is skip for both rules, so only tokens inserted from the callback will appear in the result. Since the RE is a set of 4 alternatives where each is in ()'s regex will assign a value (2 indices in a list) to each of 4 variables: cap1..cap4. All but 1 will be {-1 -1} and the other one is the longest match. The callback retrieves cap1 .. cap4 into a local array variable cap() and stops on one that is not {-1 -1}. It then will create a corresponding token type '''type#''' depending on which match occurred. The callback then injects a token of that type with the capture indices into the token stream. The example outputs some debug information, including a dump of the cap array using [parray]. ====== # (flask and displaytokens proc's inserted here, or paste the whole code section above first) # -------- sample data ------ set data { aaaaax xyaaz abbbbb } # -------- flask rules ------ set regextokens { test {(a+)|\A(ab*)|\A(aa)|\A(a+x)} {skip {do_longest ${$}}} "test for longest of 4 alternatives " skippy {.} skip "catch all anything else and toss it" } # -------- callback ------- proc do_longest {mat} { puts "match= |$mat| " foreach c {cap1 cap2 cap3 cap4} { incr group ;# 1..4 set cap($group) [uplevel set $c] if { $cap($group) ne {-1 -1} } { puts "longest @ $group = $cap($group)" break } } parray cap set type type$group ;# token of type type# set indices $cap($group) ;# pair of indices puts "type= |$type| indices= |$indices| " uplevel lappend resultline [list [list $type {*}$indices]] ;# insert a token for the longest } # now flask it and dump the token list set result [flask $regextokens $data yes yes 30] puts "\n---------------------------" displaytokens $result $data ====== Here is the output of this program. Note the use of the indent for the debug (30), to make it easier to separate the debug output from the callback output. ====== skippy \A. ( 0 0) |┊⤶┊ aaaaax⤶xyaaz⤶ abbbbb⤶| skippy \A. ( 1 1) |┊ ┊aaaaax⤶xyaaz⤶ abbbbb⤶| test \A(a+)|\A(ab*)|\A(aa)|\A(a+x) ( 2 7) |┊aaaaax┊⤶xyaaz⤶ abbbbb⤶| match= |aaaaax| longest @ 4 = 2 7 cap(1) = -1 -1 cap(2) = -1 -1 cap(3) = -1 -1 cap(4) = 2 7 type= |type4| indices= |2 7| skippy \A. ( 8 8) |┊⤶┊xyaaz⤶ abbbbb⤶| skippy \A. ( 9 9) |┊x┊yaaz⤶ abbbbb⤶| skippy \A. ( 10 10) |┊y┊aaz⤶ abbbbb⤶| test \A(a+)|\A(ab*)|\A(aa)|\A(a+x) ( 11 12) |┊aa┊z⤶ abbbbb⤶| match= |aa| longest @ 1 = 11 12 cap(1) = 11 12 type= |type1| indices= |11 12| skippy \A. ( 13 13) |┊z┊⤶ abbbbb⤶| skippy \A. ( 14 14) |┊⤶┊ abbbbb⤶| skippy \A. ( 15 15) |┊ ┊abbbbb⤶| test \A(a+)|\A(ab*)|\A(aa)|\A(a+x) ( 16 21) |┊abbbbb┊⤶| match= |abbbbb| longest @ 2 = 16 21 cap(1) = -1 -1 cap(2) = 16 21 type= |type2| indices= |16 21| skippy \A. ( 22 22) |┊⤶┊| --------------------------- 1 {type4 2 7} {type1 11 12} {type2 16 21} type4 2 7 -> |aaaaax| type1 11 12 -> |aa| type2 16 21 -> |abbbbb| ====== ** Extra - xml with verify + pretty print ** Here we have a trivial xml file. When we find the ?xml tag, we look at the rest and parse them with a recursive call to flask. When we get the results from the "inner" flask, we use a relative to absolute index function that also prepends --- in the front of the token type, for a) better visibility for this demo, and b) they can be handy for a program that is walking the list, as they could be used to tell one is past the parameters, without needing to count them first in some other way. ====== set data { Tanmay A } set rules { WS {\s+} skip "whitespace" comment {} xtoken "comments, for debuggging useful to token it" tag {(?i)]*>} {token {tagback ${$} }} "any kind of tag, case insenitive" -data {[^\<\>]+} token "any stuff between tags" error {.{1,50}} {skip {update;error "error at ${$}"} } "w/o update, might not see debug data" } proc rel2abs {rlist start} { ;# map a relative list of tokens -> absolute list by adding start set toks [lindex $rlist 0] ;# get the list in the first and only section set newlist {} foreach tok $toks { ;#convert each token lassign $tok type from to lappend newlist [list ---$type [expr { $from + $start }] [expr { $to + $start }] ] ;# to absolute add --- for visibility } return $newlist } proc tagback2 {arg} { set cap1 [uplevel set cap1] set cap2 [uplevel set cap2] uplevel lappend resultline [list [list parm {*}$cap1]] ;# standard method for injecting a token uplevel lappend resultline [list [list value {*}$cap2]] ;# note the 2 uses of the list command } proc tagback {arg} { ;# found a tag, see if it's one with extra parameters if { [string range $arg 0 4] eq " at the end, don't need em set rules { ws {\s+} skip "The ID below uses 2 capture groups" ID {([\w]+)\s*=\s*"([^\"]*?)"} {skip {tagback2 ${$} } } {foo = "bar" foo->cap1 bar->cap2} } set result [flask $rules $data yes no 20] ;# parse the list of values recursively no debug foreach atok [rel2abs $result $start] { ;# add each separately after adding start to convert to abs uplevel lappend resultline [list [list {*}$atok]] ;# note the 2 uses of the list command on the entire token } return 1 ;# indicate we found something to further parse } # set result [flask $rules $data] # puts [displaytokens $result $data] #; returns the number of tokens in total ====== with this result: ====== 1 {tag 2 44} {---parm 8 14} {---value 19 21} {---parm 24 31} {---value 36 40} {tag 93 104} {tag 109 117} {tag 125 130} {-data 131 136} {tag 137 143} {tag 151 157} {-data 158 158} {tag 159 166} {tag 171 180} {tag 182 194} tag 2 44 -> || ---parm 8 14 -> |version| ---value 19 21 -> |1.0| ---parm 24 31 -> |encoding| ---value 36 40 -> |UTF-8| tag 93 104 -> || tag 109 117 -> || tag 125 130 -> || -data 131 136 -> |Tanmay| tag 137 143 -> || tag 151 157 -> || -data 158 158 -> |A| tag 159 166 -> || tag 171 180 -> || tag 182 194 -> || 15 ====== The following is a modified version of the above, but with additional processing for several tags that contain additional information. It operates on an actual xml file that my video editor generates as a project file (videoredo). It too only has values in quotes, so it's rather easy to parse. The file has several sections, for cutlists, scenes and chapters. The final tag could also be used to output 1 more section, but that was commented out to give an example of so doing. ====== proc tagback {arg} { ;# found a tag, see if it's one with extra parameters if { [string range $arg 0 4] eq " at the end, don't need em set rules { ws {\s+} skip "The ID below uses 2 capture groups" ID {([\w]+)\s*=\s*\"([^\"]*?)"} {token {tagback2 ${$} } } {foo = "bar" foo->cap1 bar->cap2} } set result [flask $rules $data yes no 20] ;# parse the list of values recursively # puts "result= |$result| " foreach atok [rel2abs $result $start] { ;# add each separately after adding start to convert to abs uplevel lappend resultline [list [list {*}$atok]] ;# note the 2 uses of the list command on the entire token } # displaytokens $result $data 1 30 ;# for debug, we now have an indent and a token counter return 1 ;# indicate we found something to further parse } set data { 5.4.84.771 - Sep 24 2018 A:\files\dl\NASA -_Facts_And_Conse_RNgsN2JMI_Q.mp4 415356000000 01.000000 1513514 12694000000 17472000004409600000 571497511913366 1747200000 4409600000 0 1747200000 } set rules { WS {\s+} skip "whitespace" comment {} /token "comments, for debuggging useful to token it" /tag {} new+token "Special tags start a new section" tag {} new+token "" tag {} new+token "" tag {} new+token "" tag {(?i)]*>} {token {tagback ${$} }} "any kind of tag, case insenitive" -data {[^\<\>]+} token "any stuff between tags" error {.{1,50}} {skip {update;error "error at ${$}"} } "w/o update, might not see debug data" } ====== Here is only part of the output since it's rather large. What you can see is that it added some sectioning which divides the cut lists from the scene lists. The commented out '''tag''' rule would have added one more, but is here demonstrating the '''comment / character''' on the rule id name. Also, the '''comment ''' rule has its action commented out, but the rule is still processed, a commented out action, is the same as a skip. But note that had there been a callback, that would have been executed. To comment out a callback one would use the standard # of tcl, since that is a tcl statement. The time to parse that data was 2.5 ms for a total of 97 tokens generated. Notice that multiple rules create the same tag token. This lets us do some extra actions (create some sections) but still return a tag token. However, we must match the entire tag, or there would be something left over and it would not work correctly. ====== time-2,525.000 microseconds per iteration 1 {tag 42 96} {---ID 48 60} {---parm 48 54} {---value 57 59} {---ID 62 77} {---parm 62 69} {---value 72 76} {---ID 79 94} {---parm 79 88} {---value 91 93} {tag 97 126} {tag 129 164} {-data 165 188} {tag 189 207} {tag 209 218} {-data 219 269} {tag 270 280} {tag 282 294} {tag 295 308} {tag 309 320} {-data 321 321} {tag 322 334} {tag 335 344} {-data 345 355} {tag 356 366} {tag 368 383} {-data 384 384} {tag 385 401} {tag 402 420} {-data 421 428} {tag 429 448} {tag 450 458} {-data 459 459} {tag 460 469} {tag 470 485} {-data 486 488} {tag 489 505} {tag 506 521} {-data 522 524} {tag 525 541} {tag 543 555} {-data 556 566} {tag 567 580} tag 42 96 -> || ---ID 48 60 -> |version="1.0"| ---parm 48 54 -> |version| ---value 57 59 -> |1.0| ---ID 62 77 -> |encoding="UTF-8"| ---parm 62 69 -> |encoding| ---value 72 76 -> |UTF-8| ---ID 79 94 -> |standalone="yes"| ---parm 79 88 -> |standalone| ---value 91 93 -> |yes| tag 97 126 -> || tag 129 164 -> || -data 165 188 -> |5.4.84.771 - Sep 24 2018| tag 189 207 -> || ... snip ... tag 812 824 -> || tag 826 831 -> || tag 833 842 -> || 3 {tag 846 856} {tag 861 909} {---ID 874 885} {---parm 874 881} {---value 884 884} {---ID 887 908} {---parm 887 894} {---value 897 907} {-data 910 919} {tag 920 933} {tag 936 984} {---ID 949 960} {---parm 949 956} {---value 959 959} {---ID 962 983} {---parm 962 969} {---value 972 982} {-data 985 994} {tag 995 1008} {tag 1010 1021} tag 846 856 -> || tag 861 909 -> || ---ID 874 885 -> |Sequence="1"| ---parm 874 881 -> |Sequence| ---value 884 884 -> |1| ---ID 887 908 -> |Timecode="00:02:54;18"| ---parm 887 894 -> |Timecode| ---value 897 907 -> |00:02:54;18| -data 910 919 -> |1747200000| tag 920 933 -> || tag 936 984 -> || ---ID 949 960 -> |Sequence="2"| ---parm 949 956 -> |Sequence| ---value 959 959 -> |2| ---ID 962 983 -> |Timecode="00:07:20;24"| ---parm 962 969 -> |Timecode| ---value 972 982 -> |00:07:20;24| -data 985 994 -> |4409600000| tag 995 1008 -> || tag 1010 1021 -> || 4 {tag 1023 1035} {tag 1040 1090} {---ID 1055 1066} {---parm 1055 1062} {---value 1065 1065} {---ID 1068 1089} {---parm 1068 1075} {---value 1078 1088} {-data 1091 1091} {tag 1092 1107} {tag 1109 1159} {---ID 1124 1135} {---parm 1124 1131} {---value 1134 1134} {---ID 1137 1158} {---parm 1137 1144} {---value 1147 1157} {-data 1160 1169} {tag 1170 1185} {tag 1187 1200} {tag 1201 1219} tag1023 1035 -> || tag 1040 1090 -> || ---ID 1055 1066 -> |Sequence="1"| ---parm 1055 1062 -> |Sequence| ... snip ... ====== * Post processing Now that we have our token list, what can we do with it. Here's some code to walk the token lists and check that all the tags match with a closing tag. ====== # some post processing # first just get all the tag tokens, into a single list tags for extra convenience set tags {} foreach section $result { foreach tok $section { if { [lindex $tok 0] eq "tag" } { ;# only grab the tokens with type tag lappend tags $tok } } } set tags [lrange $tags 1 end] ;# trim off the ?xml it has no match at the end # now use a stack with the linear list of just the tags tokens set badxml no set stack {anchor} foreach tag $tags { lassign $tag type start end set tagkey [lindex [split [string range $data $start+1 $end-1] " "] 0] ;# split on space, -> list of 1 or more if { [string index $tagkey 0] eq "/" } { ;# so we can check if a closing tag if { [string range $tagkey 1 end] ne [lindex $stack end] } { ;# compare top of stack with our closing puts stderr "mismatch with $tagkey : [lindex $stack end]" ;# tag with its / removed set badxml yes break } set stack [lrange $stack 0 end-1] ;# pop } else { lappend stack $tagkey ;# push } } if { [llength $stack] != 1 || $stack != {anchor} || $badxml} { puts stderr "Tags malformed" } else { puts "Tags balance" } ====== A pretty print of the videoredo xml ====== set stack {} foreach section $result { foreach tok $section { lassign $tok type start end set indent [string repeat " | " [expr { [llength $stack] - 1 }] ] if { $type eq "tag" } { set tagkey [lindex [split [string range $data $start $end] " "] 0] ;# split on space, -> list of 1 or more if { [string index $tagkey end] eq ">" } { set tagkey [string range $tagkey 0 end-1] ;# string trailing > on tags with no attributes } set tagkey [string range $tagkey 1 end] if { [string index $tagkey 0] eq "/" } { set stack [lrange $stack 0 end-1] ;# pop set indent [string repeat " | " [expr { [llength $stack] - 1 }] ] # puts "$indent $tagkey" ;# optionally output the /end tag if desired, but compute indent after stack pop } else { puts "$indent $tagkey" lappend stack $tagkey ;# push } } elseif { $type eq "---parm"} { set parmtext [string range $data {*}[lrange $tok 1 2]] puts "$indent $parmtext" } elseif {$type eq "---value" } { set valutext [string range $data {*}[lrange $tok 1 2]] puts "$indent = $valutext" } elseif {$type eq "-data" } { set datatext [string range $data {*}[lrange $tok 1 2]] puts "$indent '$datatext'" } else { puts stderr "should not happen" } } } ====== outputing: ====== ?xml version = 1.0 encoding = UTF-8 standalone = yes VideoReDoProject | VideoReDoVersion | | '5.4.84.771 - Sep 24 2018' | Filename | | 'A:\files\dl\NASA -_Facts_And_Conse_RNgsN2JMI_Q.mp4' | Description | StreamType | | '4' | Duration | | '15356000000' | SyncAdjustment | | '0' | AudioVolumeAdjust | | '1.000000' | CutMode | | '1' | VideoStreamPID | | '513' | AudioStreamPID | | '514' | ProjectTime | | '12694000000' | CutList | | cut | | | Sequence | | | = 1 | | | CutStart | | | = 00:02:54;18 | | | CutEnd | | | = 00:07:20;24 | | | Elapsed | | | = 00:02:54;18 | | | CutTimeStart | | | | '1747200000' | | | CutTimeEnd | | | | '4409600000' | | | CutByteStart | | | | '5714975' | | | CutByteEnd | | | | '11913366' | SceneList | | SceneMarker | | | Sequence | | | = 1 | | | Timecode | | | = 00:02:54;18 | | | '1747200000' | | SceneMarker | | | Sequence | | | = 2 | | | Timecode | | | = 00:07:20;24 | | | '4409600000' | ChapterList | | ChapterMarker | | | Sequence | | | = 1 | | | Timecode | | | = 00:00:00;00 | | | '0' | | ChapterMarker | | | Sequence | | | = 2 | | | Timecode | | | = 00:02:54;18 | | | '1747200000' ====== And that's about it, hope this helps someone with some parsing and scanning chores. <> <> Please place any user comments here. <> <> Concept | Parsing