Version 154 of flask a mini-flex/lex proc

Updated 2021-06-23 16:17:19 by et4

ET 2021-06-07 - (3.14) The following little flex/lex proc was derived from a post on comp.lang.tcl by Christian Gollwitzer.

Intro to Flask

Flask is flex in a bottle. It is a function:

        Rules   +   Data -> Syntax Tree
Rules
These are similar to those in flex or lex which depends on regular expressions.
Data
A text string, as might be read from a file.
Syntax Tree
The tree is a simple 2 level list of lists with types and range of matched text.

The function will scan though the data string and break it into sub-strings by using text matching rules. Each sub-string is given a type code from the rule label. The rules use the powerful Tcl regular expressions and [regexp] which includes capture groups when using a flask callback [proc]. The tree leaves are called tokens and consist of 3 items in a list.

Tokens
A 3 element list {type start end}

The type is the label from the matching rule. The start and end indices can be used directly with [string range data ... ] to extract the sub-strings.

The minimal structure of the tree allows for simple grouping or sectioning. In the simplest case, the tree reduces to a single ordered list of tokens to be used with a higher level parser.

Flask can be useful for a quick parse of simple files or those "little languages" that Brian Kernighan once wrote about. Flask is fully dynamic since the rules are not compiled before hand. Rules can invoke callbacks which can extend flask by using the uplevel command. The section of examples below presents some useful techniques.

What's in the Code Section

The top of the code section below contains 2 procs, flask and a display tool. After that are some original comments and a sample set of rules and a data set (a small cad STEP file). And finally, there are calls to flask and the display tool as a demonstration. The code block can be directly copy/pasted into a console window or linux tclsh window to run the demo. For the best demo result, you will want a wide window, of about 120 chars.

Status and Acknowledgments

Flask is (as of june 2021) just ~ 90 lines of pure tcl code. It uses no global or namespace variables or packages. It's a testament to tcl that this can be done with so little code. I hope it is the kind of smallish tool that RS would approve of. His code has been a treasure trove of techniques I have used here and elsewhere. And of course, thanks to Christian who crafted the original version that I had so much fun working with.

Thanks for stopping by.

 User Guide

Flask

   Flask is a functional tool. It has 2 inputs and 1 output. It's based on the lex/flex tool.
   and derived from a post on comp.lang.tcl by Christian Gollwitzer.

Calling flask

   Flask takes 2 required arguments and 3 optional:
   
   Flask     regextokens data {flush yes} {debug no} {indent 3}
   
1. regextokens 
   
   This is a list of 4x elements, arranged in a matrix, of N x 4 with
   N rows and 4 columns (no limit on rows). All 4 columns must be present.
   
   The cells of the matrix are described below along with an example.
   
2. data
   
   This is a text string that represents the data to be parsed. If it came
   from a file, then it is simply all the text in the file as though read
   in in a singe [read $iochannel] statement.
   
3. flush 
   
   This is an optional argument, with a default value of true. If there is
   any extra text beyond the last eos token (the one that terminates the scan) then
   it/they will be flushed as tokens if this is true the default.

4. debug

   If this is true, then at each token a [puts] is output with info about the
   token type, regex, position in the input data and 20 chars of data. To see
   the exact match, one can also use a puts action with ${$} to see matched text.

5. indent

   how much to indent debug lines. useful if callbacks output other data

Output

   flask returns a list of sections. Each section is a list of 
   tokens. And each token is a list of 3 elements, an ID and 2 indices. So
   it's a 3 level structure. For example:

Returned Tokens - structure



  { {ID 0 3} {String 4 10} }   { {ID 10 13} {String 14 20} } 

    -token    ---token--         --token--   --token------
  --------section---------      --------section---------
 ----------------------------return-----------------------------
   
   The Token ID is a string describing a token type; the 2 indices: start and end
   are used to extract the token from $data, using [string range $data $start $end]
    

Flask Rule matrix

   The regextokens matrix has 4 columns, as shown below.

       1          2                 3                     4
    tokenID     Regex             action              a comment
   
   Column 1 is a label to indicate which of the regular expressions in
   column 2 matched during the scan of the data. Column 3 is an action
   to take when a match occurs, and column 4 is a comment. 
   
   The comment column is required, but can be just an empty string. It's part of
   the matrix (ok, it's really a list) but is not used for anything. However, be
   aware that the usual rules for balancing braces etc. need to be considered.

Example Rule Matrix



    tokenID     Regex             action              a comment

set regextokens {
    WS+C        {[\s,]+}          skip                "skip over whitespace and commas"
    ID          {#\d+\s*=\s*}     {token {puts "the ID was '${$}'"}}    "an ID Token and a callback"
    String      {'(''|[^'])*'}    token               "a string in single quotes"
    Zeugs       {[^;()', ]+}      token               "stuff"
    LP          {\(}              token               "left paren"
    RP          {\)}              token               "Right paren"
    SEMI        {;}               eos+token           "final result at end of section"
}

Flask processing algorithm

   
   The regular expressions will be tried one at a time from top to bottom starting at the
   first position in the input data text string. 

   When it finds a match, it looks to see what the actions are for that RE pattern, and 
   after it performs the action, and invokes any provided callback, it shifts the input
   pointer over past the matched text and starts over at the first rule looking for another
   match. 

   This proceeds until there is no more data in the string OR when there is no
   match possible. If the last rule is simply a . then it can be used as a catchall rule
   and any included callbacks will be executed. Often this is a call to an error routine. 

   Note: all regular expression will have the \A prepended so it's not needed in the rules.
         UNLESS the RE begins with the metasyntax, which must be at the front (?xyz) so
         then it is inserted after the meta.

  
   The actions can be any one of these words:
   
   skip        - will match the regexp and move the scanning pointer past the token
   token       - will match the regexp and create a token to be found in the result
   eos         - this is the end of section token, but is NOT output to the token list
   eos+token   - this is the same as eos, except a token WILL be output
   new         - this is a start section that begins a new section (or section)
   new+token   - this is a start section that begins a new section THEN outputs the token

   When using the 2 new actions there can be an empty list element at the front which
   can be ignored using a lrange $result 1 end. This will happen if the rule is the first match

   Any other text in the action field will be the same as a skip, which can facilitate
   commenting out the action.

   A good choice might be to put a / at the front. Note however, that it is really just
   a skip, and if there is a callback, it WILL be called. 

   Any rule may be commented out by having the first column, tokenID begin with a / which
   means that the RE won't be tested, as the whole rule is simply bypassed. 
   
   The action can also be a pair of items in a string list. The first must be one of the
   above actions, and the second is a callback item. Whatever text is matched can be
   accessed using ${$}. Here is an example action with a callback:

   {token {callback ${$} } }

   This will output a token for this rule and call the callback routine passing it the text
   that was matched.

Sample Flask Call

set result [flask $regextokens $data] ;# parse and return tokens into result

displaytokens $result $data           ;# debugging print out and example for traversing the result

 Code
# used to debug the output
proc displaytokens {tokens data {dump 1} {indent 0}} {
    set l 0 ;# section number
    set n 0 ;# count of tokens returned, if dump 0, only count em
    foreach line $tokens {
        incr l
        if { $dump } {
            if { $::tcl_platform(platform) eq "windows" } {
                 puts stderr "[string repeat " " $indent]$l $line"
            } else {
                 puts  "[string repeat " " $indent]$l $line"
            }
        }
        foreach token $line {
            lassign $token id from to
            if { $dump } {
                puts [string repeat " " $indent][format "     %-17s  ->   |%s|"  $token [string range $data $from $to]]
            }
            incr n
#            if { [incr count] > 100 } { ;# use this to limit output for a large file
#               return
#            }
#            update
        }   
    }
    return $n
}

# description follows, along with an example grammar spec


proc flask {regextokens data {flush yes} {debug no} {indent 3}} { ;# indent is for debug in case of recursive calls
    
    # rpos is the running position where to read the next token
    set rpos 0
    
    set result {}
    set resultline {}
    set eos 0
    set newtokens [list]
    # copy the input rules and prefix a \A to the beginning of each r.e. unless metasyntax at the front, then we insert
    foreach {key RE actionlist comment} $regextokens {
        if [regexp  {\A(\(\?[bceimnpqstwx]+\))(.*)} $RE -> meta pattern] {
            lappend newtokens $key "$meta\\A$pattern" $actionlist $comment ;# insert the \A after the meta
        } else {
            lappend newtokens $key "\\A$RE"  $actionlist $comment
        }
    }
    while true {
        set found false
        
        foreach {key RE actionlist comment} $newtokens {
            if { [string index $key 0] eq "/" } { ;# comments begin with /
                continue
            }
            if {[regexp -indices -start $rpos  $RE $data match  cap1 cap2 cap3 cap4 cap5 cap6 cap7 cap8 cap9]} {
                lassign $match start end
                if { $debug } { ;# map newlines/tabs to a unicode chars, use stderr to colorize the matched portion  (windows only)
                    set v1 [string range $data $rpos [expr {   $rpos+$end-$start     }]] 
                    set v2 [string range $data       [expr {   $rpos+$end-$start+1   }]   $rpos+40] 
                    regsub -all {\n} $v1 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v1 ;# or 21B2
                    regsub -all {\n} $v2 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v2 ;# or 21B2
                    regsub -all {\t} $v1 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 02eb] v1 ;# |- but 1 char width
                    regsub -all {\t} $v2 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 02eb] v2 ;# |- also
                    puts -nonewline  [format {%s%-10s %-40s (%4d %4d) |} [string repeat " " $indent] $key $RE $rpos $end]
                    if { $::tcl_platform(platform) eq "windows" } {
                        puts -nonewline stderr "\U2507$v1\U2507"
                    } else {
                        puts -nonewline "\U2507$v1\U2507"
                    }
                    puts "$v2|"
#                   update
                }
                set action [lindex $actionlist 0] ;# if a list, action first, then callback
                if { $action eq "token" } {
                    lappend resultline [list $key {*}$match]
                } elseif {$action eq "eos+token"} {
                    lappend resultline [list $key {*}$match]
                    set eos 1
                } elseif { $action eq "eos" } {
                    set eos 1
                } elseif { $action eq "new+token" } {
                    lappend result $resultline
                    set resultline [list]
                    lappend resultline [list $key {*}$match]
                } elseif { $action eq "new" } {
                    lappend result $resultline
                    set resultline [list]
                }

                if { [llength $actionlist] > 1 } {
                    set callback [lindex $actionlist 1]
                    set $ [string range $data $start $end]
                    eval $callback
                }
                set rpos [expr {$end+1}] ;# shift
                set found true
                break
            }
        }
        
        if {$found} {
            # minimal bottom up parsing
            # for Token designated as eos end line/section
            if {$eos} {
                lappend result $resultline
                set resultline {}
                set eos 0
#               puts "end of section"
            }
            
        } else {
            # nothing matched any longer
            if { $resultline ne {} && $flush} {
                lappend result $resultline
            }
#           puts "Parsing stopped"
            break
        }
    }
    
    return $result
}
# sample run:

#   tokenID     Regex             action              a comment

set regextokens {
    WS+C        {[\s,]+}          skip                "skip over whitespace and commas"
    ID          {#\d+\s*=\s*}     {token {puts "the ID was '${$}'"}}    "Token Id and a callback"
    String      {'(''|[^'])*'}    token               "a string in single quotes"
    Zeugs       {[^;()', ]+}      token               "stuff"
    LP          {\(}              token               "left paren"
    RP          {\)}              token               "Right paren"
    SEMI        {;}               eos+token           "final result at end of section"
}

# sample data to parse, from a STEP file


set data {

ISO-10303-21;
HEADER;
FILE_DESCRIPTION (( 'STEP AP214' ),
    '1' );
FILE_NAME ('Airp'' lane_ {V1};$ {} [] ; \ ,"-".STEP',
    '2019-11-26T14:28:03',
    ( '' ),
    ( '' ),
    'SwSTEP 2.0',
    'SolidWorks 2010',
    '' );
FILE_SCHEMA (( 'AUTOMOTIVE_''DESIGN' ));
ENDSEC;

DATA;
#1 = CARTESIAN_POINT ( 'NONE',  ( -3397.537578589738600, -40.70728434983968900, -279.1044191236024400 ) ) ;
#2 = CARTESIAN_POINT ( 'NONE',  ( 3983.737298227797500, 1647.263135894628500, 772.3224850880964100 ) ) ;
#3 = CARTESIAN_POINT ( 'NONE',  ( -457.2417019049098600, 5240.945876103178300, 87.77828949283561100 ) ) ;
#4 = CARTESIAN_POINT ( 'NONE',  ( -1338.327255407125900, -7674.784143274568100, 415.3493082692564800 ) ) ;
ENDSEC;
END-ISO-10303-21; extra junk at end

}


# ---------- some minimal tests  ---
#set data {foobar;baz;} ;# 
#set data {foobar}
#set data {}

# ---------- load data file --------
#set io [open d:/smaller.step r]
#set data [read $io]
#close $io

# ---------- run data -------------

set result [flask $regextokens $data yes yes]
displaytokens $result $data

 Examples

Flask can do one level of sectioning. This results in a list of lists. Uses include parsing statements or lines. There are 2 ways to create a section.

  • A rule begins a new section (new and new+token) before the token is added
  • A rule ends the current section (eos and eos+token) after the token is added

If no new sections are ever created, then all tokens output will be in the first and only section. An lindex $result 0 is used to get that single list from flask's return tree. Then the program can do its own more detailed parse on the ordered token stream.

Each matched sub-string can also have an action callback. The matched text can be accessed with variable ${$}. A program can also be designed to not output any tokens, but just use callbacks for all matches.

Examples

The following trivial data file, a take off on the windows configuration format, will be used with 3 styles of simple parsing and one style that uses the capturing groups of an RE. The data to parse is below and is stored in the variable data which will be used in the below examples. It is up to the program to load it however it wishes, which is a feature of flask, it separates the parsing/scanning from any I/O operations, which is more difficult to do in tools like flex.

This file has sections starting at the square brackets and individual items follow that are name=value pairs. The sections end with a ; and there can be comments with # at the beginning of a line. It's the sort of simple configuration a program might choose over more complex formats such as xml.

set data {

[Auto]
# comments here
Updates=1
News=1
;
[Request]
Updates=1
News=1
;
[epp]
Auto=1398
Version=812
File=\\\\tcl8.6.8\\generic\\tclStrToD.c
;

}
  • Style 1. A single linear list of tokens

This is the simplest way to use flask. It does not use sectioning and is a true lexical scanner. The output would be used by a parser in much the way yacc uses lex (bison and flex).

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}      " "
    WS          {[\s]+}                 skip                                    "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         token                                   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token {puts "the ITEM was '${$}'"}}    "ITEM Token and a callback"
    SEMI        {;}                     skip                                    ""
} ;# this produces a single list of tokens

Comments, whitespace, and the semicolon are parsed, but simply skipped. Running flask as shown below, will do puts callbacks to output the matched text. The output using the displaytokens procedure follows.

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug

comment is  '# comments here'
the ITEM was 'Updates=1'
the ITEM was 'News=1'
the ITEM was 'Updates=1'
the ITEM was 'News=1'
the ITEM was 'Auto=1398'
the ITEM was 'Version=812'
the ITEM was 'File=\\\\tcl8.6.8\\generic\\tclStrToD.c'

%displaytokens $result $data

1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40} {SECT 44 52} {ITEM 54 62} {ITEM 64 69} {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|

The 1 above is the section number and is 10 tokens. 

% llength $result
1
% llength [lindex $result 0]
10
  • Style 2 using an end of section (eos)

This method takes advantage of the semicolons in the data as an end of section indicator. The only difference from the previous one is that the SEMI rule action is eos. This causes it to finish off the current sub-list on each semicolon and begin a new empty one.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}      " "
    WS          {[\s]+}                 skip                                    "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         token                                   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token {puts "the ITEM was '${$}'"}}    "ITEM Token and a callback"
    SEMI        {;}                     eos                                     "semi colon for end"
} ;# this produces 2 levels, sections and lines by using a end of section token, a SEMI

And the output:

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug
... same as above ...
%displaytokens $result $data


1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
2 {SECT 44 52} {ITEM 54 62} {ITEM 64 69}
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
3 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|


% llength $result
3
% llength [lindex $result 0]
3
% llength [lindex $result 1]
3
% llength [lindex $result 2]
4
  • Style 3. Using new sections

This method would be used if there were no semicolons to indicate the end of a section, but rather when some token begins a new section. The new or new+token actions are for this purpose. When this method is used, there will always be one null section at the very beginning of the output. This can be removed using a lrange $result 1 end or just ignored. There are other ways to get around this, but likely not worth the trouble. See for example the use of state variables discussed in the advanced section.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}          " "
    WS          {[\s]+}                 skip                                        "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         {new+token {puts "Section was   '${$}'"}}   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token     {puts "   ITEM was '${$}'"}}     "ITEM Token and a callback"
    SEMI        {;}                     skip                                        "semi colon for end"
} ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning

And the results:

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug
... same as above ...
%displaytokens $result $data

1 
2 {SECT 2 7} {ITEM 25 33} {ITEM 35 40}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
3 {SECT 44 52} {ITEM 54 62} {ITEM 64 69}
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
4 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|

% llength $result
4
% llength [lindex $result 0]
0
% llength [lindex $result 1]
3
% llength [lindex $result 2]
3
% llength [lindex $result 3]
4

If the section starting tokens aren't required, say if they are simply a constant text, like "begin" or "data" one can use the action new instead of new+token to have them not included at the beginning of the sections.

  • Style 4 using callbacks

The use of capture groups with callbacks is often easier than trying to write a rule for each of the pieces and have flask build a token for each match. It is most useful when you need to describe a rule that has several components that must be all present. In a sense, this provides a bit of higher level parsing, from a lexical scanner.

  • Minimal parsing considered harmful

Flex does not seem to support capture groups. In a Flex manual FAQ it actually frowns on having a scanner do something that seems a bit like parsing saying one should use a true parser of the yacc class instead. That is always possible, since flask can be used as a simple tokenizer, and the parser would only use a single, non-sectioned, token list.

  • Multiple alternatives and longest alternative match

With the | operator, capture groups can be used to indicate which alternate matched. In that case, tcl REs will choose the longest (or first of equals, left/right) to be the match. This can simulate how flex does it's parallel matching. Flask has an extra level of control since it can also have priority by grouping alternatives in separate rules.

  • Capture groups and callbacks

The next example presents the grammar rules for our sample config file that uses a callback to further parse text and then injects tokens into the token stream itself.

This example also demonstrates a method to handle syntax errors. If flask cannot find a match in the file at some point, it will return as though it had reached the end of the data. To handle that case one can provide a catchall rule that should match anything and throw an error in the callback. This will match any characters and spit out up to 10 chars of context.

Notice the call to do_item in the ITEM rule. The callback below demonstrates using RE capture groups.


set regextokens {
    COMMENT     {#[^\n]*}               {skip {#puts "comment is  '${$}'"}}         " "
    WS          {[\s]+}                 skip                                        "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         {new+token {#puts "Section was  '${$}'"}}   "a section header"
    ITEM        {([a-zA-Z0-9]+)(\s*=)?([^\n]*)}   {skip   {do_item ${$}}}           "var=value to parse in callback"
    SEMI        {;}                     skip                                        "semi colon for end"
    ERROR       {.{1,10}}               {skip       {error "error at '${$}'"}}      "catch all is an error if we get here"
} ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning

This method with capture groups makes it easier to split apart portions of a matched text. It uses the uplevel command to retrieve the capture information. and then it builds tokens to inject into the stream. See the advanced section for details on these cap* variables and others. In this method, all the rules except SECT use the skip action and all other tokens are created from the callbacks.

proc do_item {item} { ;# proof of concept 1: to further parse the p=v using capture groups
 
         set cap1 [uplevel {set cap1}] ;# capture group 1 is the label
         set cap3 [uplevel {set cap3}] ;# capture group 3 is everything after the first =
#         puts "cap1= |$cap1| cap3= |$cap3| "
         
         uplevel lappend resultline [list [list parm  {*}$cap1]] ;# standard method for injecting a token
         uplevel lappend resultline [list [list value {*}$cap3]] ;# note the 2 uses of the list command
}

# This method uses capturing groups and the code uses cap1 and cap3. The cap* variables have the
# right indices and so makes it much easier. The number of cap groups has now been increased to 9.
# The rule uses a lazy quantifier on white-space after a label, if present, up to (but only 1) equal sign.

# Note, if a capture groups would capture a null string (say using a* and there were no a's) then
# regexp returns a start value at the point where the string would have started with an end
# index 1 less than the start value. When doing a string range, this will return a null string
# so all is good  

Here's what that looks like.

Notice the difference, in the earlier examples, without the cap* groups, the entire line was matched and produced only a single token. In that case the program using the tokens would have to do some string manipulation on the string or do it's own split up. With this method, the tokens are easily traversed, two at a time for each pair. Either way would work, it's a matter of taste.

1 
2 {SECT 2 7} {parm 25 31} {value 33 33} {parm 35 38} {value 40 40}
          SECT 2 7                  ->          |[Auto]|
          parm 25 31                ->          |Updates|
          value 33 33               ->          |1|
          parm 35 38                ->          |News|
          value 40 40               ->          |1|
3 {SECT 44 52} {parm 55 61} {value 63 63} {parm 66 69} {value 71 71}
          SECT 44 52                ->          |[Request]|
          parm 55 61                ->          |Updates|
          value 63 63               ->          |1|
          parm 66 69                ->          |News|
          value 71 71               ->          |1|
4 {SECT 75 79} {parm 81 84} {value 86 89} {parm 91 97} {value 99 101} {parm 103 106} {value 108 142}
          SECT 75 79                ->          |[epp]|
          parm 81 84                ->          |Auto|
          value 86 89               ->          |1398|
          parm 91 97                ->          |Version|
          value 99 101              ->          |812|
          parm 103 106              ->          |File|
          value 108 142             ->          |\\\\tcl8.6.8\\generic\\tclStrToD.c|

Advanced topics and Callbacks

Callbacks and flask local variables.


Typically a callback will be used to invoke a procedure. When this proc is called, it is running 1 
level below flask and so with a very simple use of [uplevel], all the local variables of flask can be 
accessed from the callback. Here are some things you can do in a callback.

Get a list of all the local variables in flask from the callback

# ---------------
set info [uplevel {info local}]  
# ---------------

Retrieve the values of some and write them to the screen with puts

# ---------------
foreach v {rpos key RE actionlist comment match start end cap1 cap2 cap3 cap4}  { ;# retrieve most useful
        set val [uplevel "set $v"] ;# get each of the above variables values
        puts [format {        %-10s = %s} $v |$val|  ]
}
set data  [uplevel {set data}] ;# get the input data but don't list it, too big
# ---------------



To access individual variables, one can use a set statement like so 
to get a variable's value and giving them each the same name locally

# ---------------
set start  [uplevel {set start}] ;# get start and store in callback local of same name
set end    [uplevel {set end}]   ;# get end   and do the same
# ---------------

Here are the variables that can be accessed and their uses:

# --------------------------------------------------------------------------------
data          The string variable, unchanging, that was passed into flask
rpos          This is the running position (as an index) into data

key           These are the 4 columns of the current rule that has matched
RE
actionlist
comment

match         The current match, a pair of indices, a string range on this pair will access the matched data
start         The first index in match
end           The second index in match

cap1          If any capturing groups, up to 9 are found in these 9 variables, -1 -1 for ones not set
cap2
cap3
cap4
...
cap9

The above are mostly for retreiving information about the current match, while the following are used
by flask to build the token lists. These are used to inject tokens into the stream from a callback

resultline    This is the current sections (originally a section was a line) list of tokens
result        This is the list of lists, at each section end, resultline is added to result and resultline is cleared

# --------------------------------------------------------------------------------

When there are capturing groups used in the regex (portions of the RE in ()'s)  the portion of the capture will be
found in cap1..cap9. The groups that were not present in the RE will be assigned the values -1 -1

A callback can add tokens to the list that is being built up. A token is always a 3 element list
which is comprised of a {type start end} triad. The 2 indices in the cap* variables are in a list
of their own as a {start end} pair.  To build a new token, from a pair of indices, and give it the 
type "mytoken" one could do this with the first capture group:


# ---------------  
# note the need here for [list [list ...]] this is because uplevel will use concat which undoes the outer one         
uplevel lappend resultline [list [list mytoken {*}$cap1]]  ;# add a token {mytoken cap-start cap-end}
# ---------------

Note that to be useful, the indices should be relative to the text found in the data variable. 
If one does some parsing of the matched string, say with a [string first] or [regex] statement
that returns indices, then one would typically need to add the offset found in the start
variable when using them to build a token. 

After a match, and AFTER the callback is made, the variable rpos is updated to point to
the next position in the text (in data) this is done as follows:

set rpos [expr {$end+1}] ;# shift

So, it is possible, to modify rpos indirectly from the callback by knowing it will  be 
updated (immediately) after the return is made from the callback. This could be accomplished
by modifying the variable "end" before returning from the callback. 

# ---------------
Starting a new section
# ---------------

uplevel {lappend result $resultline} ;# append the current section list to the result
uplevel {set   resultline {}}        ;# clear the current section

Note, this could result in an empty section depending on the length of resultline.

# ---------------
Rejecting a match 
# ---------------

A callback can also return with a [return -code continue] which will cause rpos to not
be updated, and the match to be rejected, so the foreach loop that is iterating over
the REs will continue instead of starting over at the top. 

For example, suppose one has this rule at the top

    test        {^test(.*?)\n}    {skip {do_test ${$}}}         "testing a continue"

and do_test is this:

proc do_test {arg} {
        set cap1 [uplevel {set cap1}] ;# retrieve first capture group
        set start [lindex $cap1 0]
        set end   [lindex $cap1 1]
        puts "found test with |$arg| and cap = |$cap1|"
        if { ($end - $start + 1) > 5} {
             return -code continue
        }
        return
}

In this example, the callback retrieves the current match's capture group 1 indices (start end)
and checks for a length greater than 5 and if so, it rejects the match.


# ---------------
Saving state
# ---------------

While flask does not directly support flex/lex start conditions, these can be implemented by
saving some state inside flask (or if you don't like that, use global variables etc.)

proc do_count {args} {
    set counter [uplevel {incr counter}] ;# will set it to 1 if it does not exist yet
    if { $counter <= 1 } {
        puts stderr "what to do first time only  = $counter"
    } else {
        puts stderr "what to do 2..nth time here = $counter"
    }
}

The above could be used in a callback to detect if this rule has been matched before
and perhaps ignore it or do something different.

One could also optionally do a [return -code continue] to let another rule take a shot
at it. There's no end of possibilities here, but you must be sure you are not modifying
a flask variable unintentionally. tcl presents many interesting capabilities here that
few other languages have (and none others that I know of). Have fun with this!

And of course, all of this assumes that flask will not be modified, so all bets are off if
one does that. On the other hand, it's a small proc and the source code is provided.

 

Debugging

Flask has a tracing feature that is useful for debugging your rules. It's turned on with the 4th parameter to the flask proc. For example,

set result [flask $rules $data yes yes 3]

When calling flask the final 3 parameters are optional. To set the 4th and 5th will also require a value for the 3rd. The 5th parameter provides a way to indent the output to make it easier to see any other output that might occur from callbacks. It defaults to 3. The included displaytokens also now has an indent parameter.

Here is a sample display of the output:

   WS         \A[\s]+                                  (   0    1) |┊⤶⤶┊[Auto]⤶# comments here that are very lo|
   SECT       \A\[[a-zA-Z]+\]                          (   2    7) |┊[Auto]┊⤶# comments here that are very long|
   WS         \A[\s]+                                  (   8    8) |┊⤶┊# comments here that are very long indee|
   COMMENT    \A#[^\n]*                                (   9   72) |┊# comments here that are very long indeed will go past the limit┊|
comment is  '# comments here that are very long indeed will go past the limit'
   WS         \A[\s]+                                  (  73   73) |┊⤶┊Updates=2⤶News=2⤶;⤶[Request]⤶ Updates=1⤶|
   ITEM       \A([a-zA-Z0-9]+)(\s*=)?([^\n]*)          (  74   82) |┊Updates=2┊⤶News=2⤶;⤶[Request]⤶ Updates=1⤶ |
   WS         \A[\s]+                                  (  83   83) |┊⤶┊News=2⤶;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶|
   ITEM       \A([a-zA-Z0-9]+)(\s*=)?([^\n]*)          (  84   89) |┊News=2┊⤶;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶[|
   WS         \A[\s]+                                  (  90   90) |┊⤶┊;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶[epp]⤶A|

The first column is the token code, then the regular expression (notice the \A had been added automatically) followed by the start and end indices of the data matched.

Next is the text where the rule started its match inside vertical bars, but also the match itself is inside a pair of unicode vertical 4 piece bars. With the windows console (or when enabled on linux - using the wiki code here [L1 ]), those 2 bars plus the text that matched will be colored red. All newlines/tabs are mapped to a unicode char for better visibility of the text.

All rules will be output, even if they don't produce tokens. And any callbacks that output text will be interspersed as well.

Of course since you have the source code, you can just find the [format] statements and change the column sizes if you want some other values.

Extra - longest match RE with alternatives

The tools flex and lex do pattern matching amongst the matching rules in parallel. If multiple rules can match at some point in the data, then the longest one is chosen (or first of equal sized matches). Flask is different as it tries its rules one at a time and the first match wins, not the longest.

However, Tcl regex's that use the alternative operator | will chose the longest match. So, with a callback, Flask can produce the same results, although likely not as fast.

Below we have some data, some rules, and a callback. Notice that the action is skip for both rules, so only tokens inserted from the callback will appear in the result. Since the RE is a set of 4 alternatives where each is in ()'s regex will assign a value (2 indices in a list) to each of 4 variables: cap1..cap4. All but 1 will be {-1 -1} and the other one is the longest match.

The callback retrieves cap1 .. cap4 into a local array variable cap() and stops on one that is not {-1 -1}. It then will create a corresponding token type type# depending on which match occurred. The callback then injects a token of that type with the capture indices into the token stream.

The example outputs some debug information, including a dump of the cap array using parray.

# (flask and displaytokens proc's inserted here, or paste the whole code section above first)

# -------- sample data ------

set data {
 aaaaax
xyaaz
 abbbbb
}

# -------- flask rules ------

set regextokens {
    test     {(a+)|\A(ab*)|\A(aa)|\A(a+x)}   {skip {do_longest ${$}}}   "test for longest of 4 alternatives "
    skippy   {.}                              skip                      "catch all anything else and toss it"
} 

# --------  callback -------

proc do_longest {mat} {
    puts "match= |$mat| "
    foreach c {cap1 cap2 cap3 cap4} {
        incr group      ;# 1..4
        set cap($group) [uplevel set $c]
        if { $cap($group) ne {-1 -1} } {
            puts "longest @ $group = $cap($group)"
            break 
        }
    }
    parray cap
    set type     type$group   ;# token of type type# 
    set indices $cap($group)  ;# pair of indices
    puts "type= |$type| indices= |$indices| "
    
    uplevel lappend resultline [list [list $type {*}$indices]]      ;# insert a token for the longest
}

# now flask it and dump the token list
set result [flask $regextokens $data yes yes 30]
    puts "\n---------------------------"
displaytokens $result $data

Here is the output of this program. Note the use of the indent for the debug (30), to make it easier to separate the debug output from the callback output.

                              skippy     \A.                                      (   0    0) |┊⤶┊ aaaaax⤶xyaaz⤶ abbbbb⤶|
                              skippy     \A.                                      (   1    1) |┊ ┊aaaaax⤶xyaaz⤶ abbbbb⤶|
                              test       \A(a+)|\A(ab*)|\A(aa)|\A(a+x)            (   2    7) |┊aaaaax┊⤶xyaaz⤶ abbbbb⤶|
match= |aaaaax| 
longest @ 4 = 2 7
cap(1) = -1 -1
cap(2) = -1 -1
cap(3) = -1 -1
cap(4) = 2 7
type= |type4| indices= |2 7| 
                              skippy     \A.                                      (   8    8) |┊⤶┊xyaaz⤶ abbbbb⤶|
                              skippy     \A.                                      (   9    9) |┊x┊yaaz⤶ abbbbb⤶|
                              skippy     \A.                                      (  10   10) |┊y┊aaz⤶ abbbbb⤶|
                              test       \A(a+)|\A(ab*)|\A(aa)|\A(a+x)            (  11   12) |┊aa┊z⤶ abbbbb⤶|
match= |aa| 
longest @ 1 = 11 12
cap(1) = 11 12
type= |type1| indices= |11 12| 
                              skippy     \A.                                      (  13   13) |┊z┊⤶ abbbbb⤶|
                              skippy     \A.                                      (  14   14) |┊⤶┊ abbbbb⤶|
                              skippy     \A.                                      (  15   15) |┊ ┊abbbbb⤶|
                              test       \A(a+)|\A(ab*)|\A(aa)|\A(a+x)            (  16   21) |┊abbbbb┊⤶|
match= |abbbbb| 
longest @ 2 = 16 21
cap(1) = -1 -1
cap(2) = 16 21
type= |type2| indices= |16 21| 
                              skippy     \A.                                      (  22   22) |┊⤶┊|

---------------------------
1 {type4 2 7} {type1 11 12} {type2 16 21}
          type4 2 7                 ->          |aaaaax|
          type1 11 12               ->          |aa|
          type2 16 21               ->          |abbbbb|

Extra - tokenize xml + json

Here we have a trivial xml file. When we find the ?xml tag, we look at the rest and parse them with a recursive call to flask. Of course, using flask recursively is here only to show how it can be done. This example could be done with just a few lines of tcl, using some regex and the capture groups.

When we get the results from the "inner" flask, we use a relative to absolute index function that also prepends --- in the front of the token type, for a) better visibility for this demo, and b) they can be handy for a program that is walking the list, as they could be used to tell one is past the parameters, without needing to count them first in some other way.

set data {

<?xml version = "1.0" encoding = "UTF-8" ?>

<!--Students grades are uploaded by months-->
<class_list>
   <student>
      <name>Tanmay</name>
      <grade>A</grade>
   </student>
</class_list>


}

set rules {
    WS          {\s+}                               skip                                    "whitespace"
    comment     {<!--.*?-->}                        /token                                  "comments, for debugging useful to token it"
    tag         {(?i)</?[a-z?][a-z0-9]*[^<>]*>}     {token {tagback ${$} }}                 "any kind of tag, case insensitive"
    -data       {[^\<\>]+}                          token                                   "any stuff between tags"
    error       {.{1,50}}                           {skip {update;error "error at ${$}"} }  "w/o update, might not see debug data"
}

proc rel2abs {rlist start} {    ;# map a relative list of tokens -> absolute list by adding start
    set toks [lindex $rlist 0]  ;# get the list in the first and only section
    set newlist {}
    foreach tok $toks {         ;#convert each token
        lassign $tok type from to
        lappend newlist [list ---$type [expr {  $from + $start  }]  [expr { $to + $start }]   ] ;#  to absolute add --- for visibility
    }
    return $newlist
}

proc tagback2 {arg} {
    set cap1 [uplevel set cap1]
    set cap2 [uplevel set cap2]

    uplevel lappend resultline [list [list parm  {*}$cap1]] ;# standard method for injecting a token
    uplevel lappend resultline [list [list value {*}$cap2]] ;# note the 2 uses of the list command
    
}

proc tagback {arg} {                            ;# found a tag, see if it's one with extra parameters

    if { [string range $arg 0 4] eq "<?xml" } {     ;# special tags with extra data to parse recursively
        set offset 5                            
    } else {
        return 0                                ;# not one with extra data to bust up
    }

    set start [uplevel set start]               ;# where this match starts 
    incr start $offset                          ;# for converting relative to absolute string positions
    
    set data [string range $arg $offset end-1]  ;# strip the <... and the > at the end, don't need em
    set rules {
        ws      {\s+}                           skip                        "The ID below uses 2 capture groups"
        ID      {([\w]+)\s*=\s*"([^\"]*?)"}     {skip   {tagback2 ${$} } }  {foo = "bar" foo->cap1 bar->cap2}
    }
    set result [flask $rules $data yes no 20]   ;# parse the list of values recursively no debug

    foreach atok [rel2abs $result $start] {     ;# add each separately after adding start to convert to abs
        uplevel lappend resultline [list [list {*}$atok]] ;# note the 2 uses of the list command on the entire token
    }

    return 1                                    ;# indicate we found something to further parse
}
#  set result [flask $rules $data]
#  puts [displaytokens $result $data] #; returns the number of tokens in total

with this result:

1 {tag 2 44} {---parm 8 14} {---value 19 21} {---parm 24 31} {---value 36 40} {tag 93 104} {tag 109 117} {tag 125 130} {-data 131 136} {tag 137 143} {tag 151 157} {-data 158 158} {tag 159 166} {tag 171 180} {tag 182 194}
     tag 2 44           ->   |<?xml version = "1.0" encoding = "UTF-8" ?>|
     ---parm 8 14       ->   |version|
     ---value 19 21     ->   |1.0|
     ---parm 24 31      ->   |encoding|
     ---value 36 40     ->   |UTF-8|
     tag 93 104         ->   |<class_list>|
     tag 109 117        ->   |<student>|
     tag 125 130        ->   |<name>|
     -data 131 136      ->   |Tanmay|
     tag 137 143        ->   |</name>|
     tag 151 157        ->   |<grade>|
     -data 158 158      ->   |A|
     tag 159 166        ->   |</grade>|
     tag 171 180        ->   |</student>|
     tag 182 194        ->   |</class_list>|
15

Example JSON . Here we have just a single section, so it's just a simple token stream. It could be then used by a a higher level parser, for example a simple recursive descent parser. Note 2 types of strings are allowed here. This also allows for doubled up quotes inside the quotes. Not sure that's true JSON, but that's what this does. Notice how the String rule appears twice. This could probably be done with a more complex RE in one rule, but this seems simpler. The \" is the same as just " but helps my editor's syntax coloring; it is not needed.

 
set data {
"class_list":{"student":{
        "name":"Tammy",
        "age": 19,
        'grade':"A",
        "phoneNumbers": [
        { "type": "home", "number": "7349282382" }
        ]
    }}
}

set rules {
    WS          {\s+}                               skip                  "whitespace"
    String      {'(?:''|[^'])*'}                    token                 "a string in single quotes"
    String      {"(?:""|[^\"])*"}                   token                 "a string in double quotes"
    Number      {[-+]?[0-9]+(\.[0-9]+)?}            token                 "integer or a floating point number with mandatory integer part. sign optional"
    
    Lcurly      {\{}                                token                 "left brace"
    Rcurly      {\}}                                token                 "Right brace"
    Lsquare     {\[}                                token                 "left square bracket"
    Rsquare     {\]}                                token                 "Right square bracket"
    Colon       {:}                                 token                 "colon"
    Comma       {,}                                 token                 "comma"
    error       {.{1,50}}                           {skip {update;error "error at ${$}\n\n"} }  "w/o update, might not see debug data"
}

 set result [flask $rules $data]
 displaytokens $result $data 1  

and the output:

1 {String 1 12} {Colon 13 13} {Lcurly 14 14} {String 15 23} {Colon 24 24} {Lcurly 25 25} {String 29 34} {Colon 35 35} ...
     String 1 12        ->   |"class_list"|
     Colon 13 13        ->   |:|
     Lcurly 14 14       ->   |{|
     String 15 23       ->   |"student"|
     Colon 24 24        ->   |:|
     Lcurly 25 25       ->   |{|
     String 29 34       ->   |"name"|
     Colon 35 35        ->   |:|
     String 36 42       ->   |"Tammy"|
     Comma 43 43        ->   |,|
     String 47 51       ->   |"age"|
     Colon 52 52        ->   |:|
     Number 54 55       ->   |19|
     Comma 56 56        ->   |,|
     String 60 66       ->   |'grade'|
     Colon 67 67        ->   |:|
     String 68 70       ->   |"A"|
     Comma 71 71        ->   |,|
     String 75 88       ->   |"phoneNumbers"|
     Colon 89 89        ->   |:|
     Lsquare 91 91      ->   |[|
     Lcurly 95 95       ->   |{|
     String 97 102      ->   |"type"|
     Colon 103 103      ->   |:|
     String 105 110     ->   |"home"|
     Comma 111 111      ->   |,|
     String 113 120     ->   |"number"|
     Colon 121 121      ->   |:|
     String 123 134     ->   |"7349282382"|
     Rcurly 136 136     ->   |}|
     Rsquare 140 140    ->   |]|
     Rcurly 143 143     ->   |}|
     Rcurly 144 144     ->   |}|

The following is a slight larger version of the above XML code and rules, with additional processing for several tags that contain additional information. It operates on an actual XML file that my video editor generates as a project file (Videoredo). The file has several sections, for cutlists, scenes and chapters. The final tag could also be used to output 1 more section, but that was commented out to give an example of so doing.

proc tagback {arg} {                            ;# found a tag, see if it's one with extra parameters

    if { [string range $arg 0 4] eq "<?xml" } {     ;# special tags with extra data to parse recursively
        set offset 5                            
    } elseif { [string range $arg 0 11] eq "<SceneMarker" } { 
        set offset 12
    } elseif { [string range $arg 0 13] eq "<ChapterMarker" } {
        set offset 14
    } elseif { [string range $arg 0 3] eq "<cut" } {
        set offset 4                            ;# skip the <cut and we'll skip any whitespace here with a rule
    } else {
        return 0                                ;# not one with extra data to bust up
    }

    set start [uplevel set start]               ;# where this match starts 
    incr start $offset                          ;# for converting relative to absolute string positions
    
    set data [string range $arg $offset end-1]  ;# strip the <... and the > at the end, don't need em
    set rules {
        ws      {\s+}                           skip                        "The ID below uses 2 capture groups"
        ID      {([\w]+)\s*=\s*\"([^\"]*?)"}   {token {tagback2 ${$} } }  {foo = "bar" foo->cap1 bar->cap2}
    }
    set result [flask $rules $data yes no 20]   ;# parse the list of values recursively
#   puts "result= |$result| "
    foreach atok [rel2abs $result $start] {     ;# add each separately after adding start to convert to abs
        uplevel lappend resultline [list [list {*}$atok]] ;# note the 2 uses of the list command on the entire token
    }
#   displaytokens $result $data 1 30            ;# for debug, we now have an indent and a token counter


    return 1                                    ;# indicate we found something to further parse
}


set data {

<!-- The following is a project file
     and this is an XML comment block
-->

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><VideoReDoProject Version="5">
 <VideoReDoVersion BuildNumber="771">5.4.84.771 - Sep 24 2018</VideoReDoVersion>
<Filename>A:\files\dl\NASA  -_Facts_And_Conse_RNgsN2JMI_Q.mp4</Filename>
<Description></Description><StreamType>4</StreamType><Duration>15356000000</Duration>
<SyncAdjustment>0</SyncAdjustment><AudioVolumeAdjust>1.000000</AudioVolumeAdjust>
<CutMode>1</CutMode><VideoStreamPID>513</VideoStreamPID><AudioStreamPID>514</AudioStreamPID>
<ProjectTime>12694000000</ProjectTime>
<CutList> 
<cut  Sequence="1" CutStart="00:02:54;18" CutEnd="00:07:20;24" Elapsed="00:02:54;18"> 
<CutTimeStart>1747200000</CutTimeStart><CutTimeEnd>4409600000</CutTimeEnd>
<CutByteStart>5714975</CutByteStart><CutByteEnd>11913366</CutByteEnd>
</cut>
</CutList> 
<SceneList>    <SceneMarker Sequence="1" Timecode="00:02:54;18">1747200000</SceneMarker> 
<SceneMarker Sequence="2" Timecode="00:07:20;24">4409600000</SceneMarker>
</SceneList>
<ChapterList>   
<ChapterMarker Sequence="1" Timecode="00:00:00;00">0</ChapterMarker>
<ChapterMarker Sequence="2" Timecode="00:02:54;18">1747200000</ChapterMarker>
</ChapterList></VideoReDoProject>

}

set rules {
    WS          {\s+}                               skip                                    "whitespace"
    comment     {<!--.*?-->}                        /token                                  "comments, for debuggging useful to token it"
    
    /tag        {</VideoReDoProject>}           new+token                               "Special tags start a new section"
    tag         {<ChapterList>}                 new+token                               ""
    tag         {<Cutlist>}                     new+token                               ""
    tag         {<SceneList>}                   new+token                               ""
    
    tag         {(?i)</?[a-z?][a-z0-9]*[^<>]*>}     {token {tagback ${$} }}                 "any kind of tag, case insenitive"
    -data       {[^\<\>]+}                          token                                   "any stuff between tags"
    error       {.{1,50}}                           {skip {update;error "error at ${$}"} }  "w/o update, might not see debug data"
}


Here is only part of the output since it's rather large. What you can see is that it added some sectioning which divides the cut lists from the scene lists. The commented out tag rule would have added one more, but is here demonstrating the comment / character on the rule id name.

Also, the comment rule has its action commented out, but the rule is still processed, a commented out action, is the same as a skip. But note that had there been a callback, that would have been executed. To comment out a callback one would use the standard # of tcl, since that is a tcl statement. The time to parse that data was 2.5 ms for a total of 97 tokens generated.

Notice that multiple rules create the same tag token. This lets us do some extra actions (create some sections) but still return a tag token. However, we must match the entire tag, or there would be something left over and it would not work correctly.

time-2,525.000 microseconds per iteration
1 {tag 42 96} {---ID 48 60} {---parm 48 54} {---value 57 59} {---ID 62 77} {---parm 62 69} {---value 72 76} {---ID 79 94} {---parm 79 88} {---value 91 93} {tag 97 126} {tag 129 164} {-data 165 188} {tag 189 207} {tag 209 218} {-data 219 269} {tag 270 280} {tag 282 294} {tag 295 308} {tag 309 320} {-data 321 321} {tag 322 334} {tag 335 344} {-data 345 355} {tag 356 366} {tag 368 383} {-data 384 384} {tag 385 401} {tag 402 420} {-data 421 428} {tag 429 448} {tag 450 458} {-data 459 459} {tag 460 469} {tag 470 485} {-data 486 488} {tag 489 505} {tag 506 521} {-data 522 524} {tag 525 541} {tag 543 555} {-data 556 566} {tag 567 580}
     tag 42 96          ->   |<?xml version="1.0" encoding="UTF-8" standalone="yes"?>|
     ---ID 48 60        ->   |version="1.0"|
     ---parm 48 54      ->   |version|
     ---value 57 59     ->   |1.0|
     ---ID 62 77        ->   |encoding="UTF-8"|
     ---parm 62 69      ->   |encoding|
     ---value 72 76     ->   |UTF-8|
     ---ID 79 94        ->   |standalone="yes"|
     ---parm 79 88      ->   |standalone|
     ---value 91 93     ->   |yes|
     tag 97 126         ->   |<VideoReDoProject Version="5">|
     tag 129 164        ->   |<VideoReDoVersion BuildNumber="771">|
     -data 165 188      ->   |5.4.84.771 - Sep 24 2018|
     tag 189 207        ->   |</VideoReDoVersion>|

... snip ...

     tag 812 824        ->   |</CutByteEnd>|
     tag 826 831        ->   |</cut>|
     tag 833 842        ->   |</CutList>|
3 {tag 846 856} {tag 861 909} {---ID 874 885} {---parm 874 881} {---value 884 884} {---ID 887 908} {---parm 887 894} {---value 897 907} {-data 910 919} {tag 920 933} {tag 936 984} {---ID 949 960} {---parm 949 956} {---value 959 959} {---ID 962 983} {---parm 962 969} {---value 972 982} {-data 985 994} {tag 995 1008} {tag 1010 1021}
     tag 846 856        ->   |<SceneList>|
     tag 861 909        ->   |<SceneMarker Sequence="1" Timecode="00:02:54;18">|
     ---ID 874 885      ->   |Sequence="1"|
     ---parm 874 881    ->   |Sequence|
     ---value 884 884   ->   |1|
     ---ID 887 908      ->   |Timecode="00:02:54;18"|
     ---parm 887 894    ->   |Timecode|
     ---value 897 907   ->   |00:02:54;18|
     -data 910 919      ->   |1747200000|
     tag 920 933        ->   |</SceneMarker>|
     tag 936 984        ->   |<SceneMarker Sequence="2" Timecode="00:07:20;24">|
     ---ID 949 960      ->   |Sequence="2"|
     ---parm 949 956    ->   |Sequence|
     ---value 959 959   ->   |2|
     ---ID 962 983      ->   |Timecode="00:07:20;24"|
     ---parm 962 969    ->   |Timecode|
     ---value 972 982   ->   |00:07:20;24|
     -data 985 994      ->   |4409600000|
     tag 995 1008       ->   |</SceneMarker>|
     tag 1010 1021      ->   |</SceneList>|
4 {tag 1023 1035} {tag 1040 1090} {---ID 1055 1066} {---parm 1055 1062} {---value 1065 1065} {---ID 1068 1089} {---parm 1068 1075} {---value 1078 1088} {-data 1091 1091} {tag 1092 1107} {tag 1109 1159} {---ID 1124 1135} {---parm 1124 1131} {---value 1134 1134} {---ID 1137 1158} {---parm 1137 1144} {---value 1147 1157} {-data 1160 1169} {tag 1170 1185} {tag 1187 1200} {tag 1201 1219}
     tag1023 1035       ->   |<ChapterList>|
     tag 1040 1090      ->   |<ChapterMarker Sequence="1" Timecode="00:00:00;00">|
     ---ID 1055 1066    ->   |Sequence="1"|
     ---parm 1055 1062  ->   |Sequence|

... snip ...

Extra - verify, pretty print xml + json -> xml

Now that we have our token list, what can we do with it?

Here's some code to walk the token lists and check that all the tags match with a closing tag.

# some post processing 
# first just get all the tag tokens, into a single list tags for extra convenience

set tags {}
foreach section $result {
    foreach tok $section {
        if { [lindex $tok 0] eq "tag" } { ;# only grab the tokens with type tag
            lappend tags $tok
        }   
    }   
}


set tags [lrange $tags 1 end] ;# trim off the ?xml it has no match at the end

# now use a stack with the linear list of just the tags tokens

set badxml no
set stack {anchor}
foreach tag $tags {
    lassign $tag type start end
    set tagkey [lindex [split [string range $data $start+1 $end-1] " "] 0]  ;# split on space, -> list of 1 or more
    if { [string index $tagkey 0] eq "/" } {                                ;# so we can check if a closing tag
        if { [string range $tagkey 1 end] ne [lindex $stack end] } {        ;# compare top of stack with our closing
            puts stderr "mismatch with $tagkey : [lindex $stack end]"       ;# tag with its / removed
            set badxml yes
            break
        }
        set stack [lrange $stack 0 end-1]       ;# pop 
    } elseif {[string index $tagkey end] eq "/"} { ;# ends with a / then it an empty tag, like both, a push and a pop
    } else {
        lappend stack $tagkey                   ;# push
    }
}
if { [llength $stack] != 1 || $stack != {anchor} || $badxml} {
    puts stderr "Tags malformed"
} else {
    puts "Tags balance" 
}

Next is an example pretty print of the videoredo XML using unicode.

set stack {}
set vline1 "\U2502" ;# a nice vertical line
set vline2 "\U2515" ;# vertical line with a foot, since we have validated, no need to muddy it up with the end tag itself
foreach section $result {
    foreach tok $section {
        lassign $tok type start end
        set indent [string repeat " $vline1  " [expr {    [llength $stack] - 1   }] ] 
        if { $type eq "tag" } {
            set tagkey [lindex [split [string range $data $start $end] " "] 0] ;# split on space, -> list of 1 or more
            if { [string index $tagkey end] eq ">" } {
                set tagkey [string range $tagkey 0 end-1] ;# string trailing > on tags with no attributes
            }
            set tagkey [string range $tagkey 1 end]
            if { [string index $tagkey 0] eq "/" } {
                set stack [lrange $stack 0 end-1]       ;# pop 
                set indent [string repeat " $vline1  " [expr {    [llength $stack] - 1   }] ] 
                puts "$indent $vline2" ;# optionally output the /end tag if desired, but compute indent after stack pop
            } elseif {[string index $tagkey end] eq "/"} {
                puts "$indent $tagkey"
            } else {
                puts "$indent $tagkey"
                lappend stack $tagkey                   ;# push
            }
        } elseif { $type eq "---parm"} {
            set parmtext [string range $data {*}[lrange $tok 1 2]] 
            puts "$indent $parmtext"
        } elseif {$type eq  "---value" } {
            set valutext [string range $data {*}[lrange $tok 1 2]] 
            puts  "$indent      = $valutext"
        } elseif {$type eq "-data" } {
            set datatext [string range $data {*}[lrange $tok 1 2]] 
            puts  "$indent '$datatext'"
        } else {
            puts stderr "should not happen"
        }
    }   
}

outputing:

 ?xml
 version
      = 1.0
 encoding
      = UTF-8
 standalone
      = yes
 VideoReDoProject
 │   VideoReDoVersion
 │   │   '5.4.84.771 - Sep 24 2018'
 │   ┕
 │   Filename
 │   │   'A:\files\dl\NASA  -_Facts_And_Conse_RNgsN2JMI_Q.mp4'
 │   ┕
 │   Description
 │   ┕
 │   StreamType
 │   │   '4'
 │   ┕
 │   Duration
 │   │   '15356000000'
 │   ┕
 │   SyncAdjustment
 │   │   '0'
 │   ┕
 │   AudioVolumeAdjust
 │   │   '1.000000'
 │   ┕
 │   CutMode
 │   │   '1'
 │   ┕
 │   VideoStreamPID
 │   │   '513'
 │   ┕
 │   AudioStreamPID
 │   │   '514'
 │   ┕
 │   ProjectTime
 │   │   '12694000000'
 │   ┕
 │   CutList
 │   │   cut
 │   │   │   Sequence
 │   │   │        = 1
 │   │   │   CutStart
 │   │   │        = 00:02:54;18
 │   │   │   CutEnd
 │   │   │        = 00:07:20;24
 │   │   │   Elapsed
 │   │   │        = 00:02:54;18
 │   │   │   CutTimeStart
 │   │   │   │   '1747200000'
 │   │   │   ┕
 │   │   │   CutTimeEnd
 │   │   │   │   '4409600000'
 │   │   │   ┕
 │   │   │   CutByteStart
 │   │   │   │   '5714975'
 │   │   │   ┕
 │   │   │   CutByteEnd
 │   │   │   │   '11913366'
 │   │   │   ┕
 │   │   ┕
 │   ┕
 │   SceneList
 │   │   SceneMarker
 │   │   │   Sequence
 │   │   │        = 1
 │   │   │   Timecode
 │   │   │        = 00:02:54;18
 │   │   │   '1747200000'
 │   │   ┕
 │   │   SceneMarker
 │   │   │   Sequence
 │   │   │        = 2
 │   │   │   Timecode
 │   │   │        = 00:07:20;24
 │   │   │   '4409600000'
 │   │   ┕
 │   ┕
 │   ChapterList
 │   │   ChapterMarker
 │   │   │   Sequence
 │   │   │        = 1
 │   │   │   Timecode
 │   │   │        = 00:00:00;00
 │   │   │   '0'
 │   │   ┕
 │   │   ChapterMarker
 │   │   │   Sequence
 │   │   │        = 2
 │   │   │   Timecode
 │   │   │        = 00:02:54;18
 │   │   │   '1747200000'
 │   │   ┕
 │   ┕
 ┕

Here's a small proc to convert the a JSON token stream to XML. It's a work in progress, it doesn't handle the square brackets in JSON. For now it ignores them, so if there's more than one item in the array, it won't handle it correctly.

Of course there's a tcl extension for this, written in C. So, this is just to show how using Flask to tokenize can possibly save a lot of work for these sorts of jobs when you really cannot use anything other than pure tcl and not even use extensions (say if you've got a starpack with few extensions built in).

proc j2x {toks data} {
    set stack [list]     ;# stack the tags
    set lrstack [list 0] ;# stack the left/right flags (so String : String can be tracked l/r, since we only tokenize, dont parse
    set tdata {}
    set indent {}
    set output {<?xml version="1.0" encoding= "UTF-8" standalone= "yes" ?>}
    append output "\n"
    foreach tok $toks {
        lassign $tok type start end
        set tdata [string range $data $start $end]
        set indent [string repeat "   " [expr {    [llength $stack]  }] ] 
        if       { $type eq "String"} {
            if { [lindex $lrstack end] } { ;# which String is this left or right of a colon?
                append output  "[string range $tdata 1 end-1]</$sdata>" "\n"
            } else {
                set sdata [string range $tdata 1 end-1] ;# stript the quotes
                append output "${indent}<$sdata>"
            }
        } elseif { $type eq "Colon" } {
            set lr [expr {   1 - [lindex $lrstack end]   }] ;# toggle left/right indicator
            set lrstack [lreplace $lrstack end end $lr]     ;# but need to stack them too
        } elseif { $type eq "Rcurly" } {
            set indent [string repeat "   " [expr {    [llength $stack] - 1 }] ] 
            append output "${indent}</[lindex $stack end]>" "\n"
            set stack [lrange $stack 0 end-1]
            set lrstack [lrange $lrstack 0 end-1]
        } elseif { $type eq "Lcurly" } {
            lappend stack $sdata ;# put our last String on the stack
            lappend lrstack 0    ;# and stack our left/right flags to a 0 (left)
        } elseif { $type eq "Comma" } {
            set lrstack [lrange $lrstack 0 end-1]
            lappend lrstack 0
        } elseif { $type eq "Number" } {
                append output  "${tdata}</$sdata>\n"
        } elseif { $type eq "Lsquare" } {
#           puts "<-- lsquare $stack --> " ;# tbd - not sure how to translate json arrays
        } elseif { $type eq "Rsquare" } {
#           puts "<-- rsquare $stack -->"  ;# tbd
        }   
    }
    set output [regsub -all {>([ \t]+)<} $output ">\n\\1<"] ;# fix those pesky missing newlines
    return $output
}

and with the following data, it returned:


# --------------------------

set data {
"class_list":{"student":{
        "name":"Tammy",
        "age": 19,
        'grade':"B",
        "phoneNumbers": [
        { "type": "home", "number": "7349282382" }
        ]
    }}
}

# --------------------------

Here's the token stream that was created for this JSON using the rules from the previous section above, generated by,

    set result [flask $rules $data]


     String 1 12        ->   |"class_list"|
     Colon 13 13        ->   |:|
     Lcurly 14 14       ->   |{|
     String 15 23       ->   |"student"|
     Colon 24 24        ->   |:|
     Lcurly 25 25       ->   |{|
     String 35 40       ->   |"name"|
     Colon 41 41        ->   |:|
     String 42 48       ->   |"Tammy"|
     Comma 49 49        ->   |,|
     String 59 63       ->   |"age"|
     Colon 64 64        ->   |:|
     Number 66 67       ->   |19|
     Comma 68 68        ->   |,|
     String 78 84       ->   |'grade'|
     Colon 85 85        ->   |:|
     String 86 88       ->   |"B"|
     Comma 89 89        ->   |,|
     String 99 112      ->   |"phoneNumbers"|
     Colon 113 113      ->   |:|
     Lsquare 115 115    ->   |[|
     Lcurly 125 125     ->   |{|
     String 127 132     ->   |"type"|
     Colon 133 133      ->   |:|
     String 135 140     ->   |"home"|
     Comma 141 141      ->   |,|
     String 143 150     ->   |"number"|
     Colon 151 151      ->   |:|
     String 153 164     ->   |"7349282382"|
     Rcurly 166 166     ->   |}|
     Rsquare 176 176    ->   |]|
     Rcurly 182 182     ->   |}|
     Rcurly 183 183     ->   |}|

# --------------------------

And this was used 
 
    set xml [j2x [lindex $result 0] $data] ;# our token list is in the first section of the result
    puts $xml

to generate this:

<?xml version="1.0" encoding= "UTF-8" standalone= "yes" ?>
<class_list>
   <student>
      <name>Tammy</name>
      <age>19</age>
      <grade>B</grade>
      <phoneNumbers>
         <type>home</type>
         <number>7349282382</number>
      </phoneNumbers>
   </student>
</class_list>

And that's about it, hope this helps someone with some parsing and scanning chores.

Please place any user comments here.