Version 43 of flask a mini-flex/lex proc

ET 2021-06-07 - The following little flex/lex proc was derived from a post on comp.lang.tcl by Christian Gollwitzer.

Intro to Flask

Flask is flex in a bottle. This means it is a self contained proc that can do some simple regular expression parsing in a technique similar to how the tools flex and lex are used. However, it does not read and write files, but rather is driven by its 2 input parameters only and parses data using the rules on every call.

The input is a list of rules and a text string to parse. The return value is a list of lists of tokens, and a token is a 3 element list with a token type and 2 text indices which point to the text location in the input-ed data file, such that a string range $data $start $end can be used to retrieve the text for a token.

It can be useful for a quick parse of simple files or those "little languages" that Brian Kernighan used to write about. And one can parse a couple different grammars at the same time.

The code section below is setup so you can copy/paste it into a windows console or linux terminal window (after running tclsh) for testing. It includes flask at the top and some test code at the bottom including some test data from a cad step file.

On windows or linux using rlwrap, the last 2 commands can be recalled by an up arrow. If you add a 4th parameter to the flask call as a true/yes it will re-parse with debug on so you can see what that looks like.

Also included is a simple debugging display tokens proc.

Flask is (as of june 21) just 87 lines of pure tcl code. It uses no global (with 1 exception where the debug code tests for windows vs. others so it can output some text in colors) or namespace variables and no required packages. It's a testament to tcl that so much can be done with so little. By way of comparison, this wiki entry is about 10 times as long. I hope Mr. Wiki, Richard Suchenwirth, would approve. And of course, thanks to Christian who crafted the original version that I had so much fun working with.

Thanks for stopping by.

User Guide

Flask

   Flask is a mini-lex in a bottle. I.e. it's a self-conatained tcl proc
   with no global or namespace data (but feel free to rename including
   a namespace proc name) and does not need anything other than pure tcl
   with no additional packages required. It's based on the lex/flex tool.
   and derived from a post on comp.lang.tcl by Christian Gollwitzer.

Calling flask

   Flask takes 2 required arguments and 3 optional:
   
   Flask     regextokens data {flush yes} {debug no} {indent 3}
   
1. regextokens 
   
   This is a list of 4x elements, arranged in a matrix, of N x 4 with
   N rows and 4 columns (no limit on rows). All 4 columns must be present.
   
   The cells of the matrix are described below along with an example.
   
2. data
   
   This is a text string that represents the data to be parsed. If it came
   from a file, then it is simply all the text in the file as though read
   in in a singe [read $iochannel] statement.
   
3. flush 
   
   This is an optional argument, with a default value of true. If there is
   any extra text beyond the last eos token (the one that terminates the scan) then
   it/they will be flushed as tokens if this is true the default.

4. debug

   If this is true, then at each token a [puts] is output with info about the
   token type, regex, position in the input data and 20 chars of data. To see
   the exact match, use a puts action with ${$} replacing the matched text.

5. indent

   how much to indent debug lines. Useful if calling flask from a callback
   to give extra indentation if both main and callback have debug turned on
   probably should include more debug parameters in a list, for max widths of
   the various fields - but user can always just edit that

Output

   flask returns a list of statements. Each statement is a list of 
   tokens. And each token is a list of 3 elements, an ID and 2 indices. So
   it's a 3 level structure. For example:

Returned Tokens - structure



  { {ID 0 3} {String 4 10} }   { {ID 10 13} {String 14 20} } 

    -token    ---token--         --token--   --token------
  --------statement---------      --------statement---------
 ----------------------------return-----------------------------
   
   The Token ID is a string describing a token type; the 2 indices: start and end
   are used to extract the token from $data, using [string range $data $start $end]
    

   The regextokens matrix has 4 columns.

       1          2                 3                     4
    tokenID     Regex             action              a comment
   
   Column 1 is a label to indicate which of the regular expressions in
   column 2 matched during the scan of the data. Column 3 is an action
   to take when a match occurs, and column 4 is a comment. 
   
   The comment is required, but can be just an empty string. It's part of
   the matrix (really a list) but is not used for anything. However, be
   aware that the usual rules for balancing braces etc. need be kept.

Flask processing algorithm

   
   The regular expressions will be tried one at a time from top to bottom starting at the
   first position in the input data text string. 

   When it finds a match, it looks to see what the actions are for that RE pattern, and 
   after it performs the action, and invokes any provided callback, it shifts the input
   pointer over past the matched text and starts over at the first rule looking for another
   match. 

   This proceeds until there is no more data in the string OR when there is no
   match possible. If the last rule is simply a . then it can be used as a catchall rule
   and any included callbacks will be executed. Often this is a call to an error routine. 

   Note: all regular expression will have the \A prepended so it's not needed in the rules.
  
   The actions can be any one of these words:
   
   skip        - will match the regexp and move the scanning pointer past the token
   token       - will match the regexp and create a token to be found in the result
   eos         - this is the end of statement token, but is NOT output to the token list
   eos+token   - this is the same as eos, except a token WILL be output
   new         - this is a start statement that begins a new statement (or section)
   new+token   - this is a start statement that begins a new statement THEN outputs the token

   When using the new actions there will be an empty list element at the front which
   can be ignored using a lrange $result 1 end

   Any other text in the action field will be the same as a skip, which can facilitate
   commenting out the action
   
   The action can also be a pair of items in a string list. The first must be one of the
   above actions, and the second is a callback item. Whatever text is matched can be
   accessed using ${$}. Here is an example action with a callback:

   {token {callback ${$} } }

   This will output a token for this rule and call the callback routine passing it the text
   that was matched.

Example grammar Matrix



    tokenID     Regex             action              a comment

set regextokens {
    WS+C        {[\s,]+}          skip                "skip over whitespace and commas"
    ID          {#\d+\s*=\s*}     {token {puts "the ID was '${$}'"}}    "Token Id and a callback"
    String      {'(''|[^'])*'}    token               "a string in single quotes"
    Zeugs       {[^;()', ]+}      token               "stuff"
    LP          {\(}              token               "left paren"
    RP          {\)}              token               "Right paren"
    SEMI        {;}               eos+token           "final result at end of statement"
}

Sample call

set result [flask $regextokens $data] ;# parse and return tokens into result

displaytokens $result $data           ;# debugging print out and example for traversing the result

Code

# used to debug the output
proc displaytokens {tokens data} {
    set l 0
    foreach line $tokens {
        puts stderr "[incr l] $line"
        foreach token $line {
            lassign $token id from to
            puts [format "    %-17s  ->   |%s|"  $token [string range $data $from $to]]
#            if { [incr count] > 100 } { ;# use this to limit output for a large file
#               return
#            }
#            update
        }   
    }
}

# description follows, along with an example grammar spec


proc flask {regextokens data {flush yes} {debug no} {indent 3}} { ;# indent is for debug in case of recursive calls
    
    # rpos is the running position where to read the next token
    set rpos 0
    
    set result {}
    set resultline {}
    set eos 0
    set newtokens [list]
    # copy the input rules and add a \A to each r.e.
    foreach {key RE actionlist comment} $regextokens {
        lappend newtokens $key "\\A$RE"  $actionlist $comment
    }
    while true {
        set found false
        
        foreach {key RE actionlist comment} $newtokens {
            if {[regexp -indices -start $rpos  $RE $data match  cap1 cap2 cap3 cap4 cap5 cap6 cap7 cap8 cap9]} {
                lassign $match start end
                if { $debug } { ;# map newlines to a unicode char, use stderr to colorize the matched portion (windows only)
                    set v1 [string range $data $rpos [expr {   $rpos+$end-$start     }]] 
                    set v2 [string range $data       [expr {   $rpos+$end-$start+1   }]   $rpos+40] 
                    regsub -all {\n} $v1 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v1 ;# or 21B2
                    regsub -all {\n} $v2 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v2 ;# or 21B2
                    puts -nonewline  [format {%s%-10s %-40s (%4d %4d) |} [string repeat " " $indent] $key $RE $rpos $end]
                    if { $::tcl_platform(platform) eq "windows" } {
                        puts -nonewline stderr "\U2507$v1\U2507"
                    } else {
                        puts -nonewline "\U2507$v1\U2507"
                    }
                    puts "$v2|"
#                   update
                }
                set action [lindex $actionlist 0] ;# if a list, action first, then callback
                if { $action eq "token" } {
                    lappend resultline [list $key {*}$match]
                } elseif {$action eq "eos+token"} {
                    lappend resultline [list $key {*}$match]
                    set eos 1
                } elseif { $action eq "eos" } {
                    set eos 1
                } elseif { $action eq "new+token" } {
                    lappend result $resultline
                    set resultline [list]
                    lappend resultline [list $key {*}$match]
                } elseif { $action eq "new" } {
                    lappend result $resultline
                    set resultline [list]
                }

                if { [llength $actionlist] > 1 } {
                    set callback [lindex $actionlist 1]
                    set $ [string range $data $start $end]
                    eval $callback
                }
                set rpos [expr {$end+1}] ;# shift
                set found true
                break
            }
        }
        
        if {$found} {
            # minimal bottom up parsing
            # for Token designated as eos end line/statement
            if {$eos} {
                lappend result $resultline
                set resultline {}
                set eos 0
#               puts "end of statement"
            }
            
        } else {
            # nothing matched any longer
            if { $resultline ne {} && $flush} {
                lappend result $resultline
            }
#           puts "Parsing stopped"
            break
        }
    }
    
    return $result
}


#   Flask is a mini-lex in a bottle. I.e. it's a self-conatained tcl proc
#   with no global or namespace data (but feel free to rename including
#   a namespace proc name) and does not need anything other than pure tcl
#   with no additional packages required. It's based on the lex/flex tool.
#   and derived from a post on comp.lang.tcl by Christian Gollwitzer.
#
#   It takes 2 required arguments and 2 optional:
#   
#       regextokens data {flush yes} {debug no} {indent 3}
#   
#   1. regextokens 
#   
#   This is a list of 4x elements, arranged in a matrix, of N x 4 with
#   N rows and 4 columns (no limit on rows). All 4 columns must be present.
#   
#   The cells of the matrix are described below along with an example.
#   
#   2. data
#   
#   This is a text string that represents the data to be parsed. If it came
#   from a file, then it is simply all the text in the file as though read
#   in in a singe [read $iochannel] statement.
#   
#   3. flush 
#   
#   This is an optional argument, with a default value of true. If there is
#   any extra text beyond the last eos token (the one that terminates the scan) then
#   it/they will be flushed as tokens if this is true the default.
#
#   4. debug
#
#   If this is true, then at each token a [puts] is output with info about the
#   token type, regex, position in the input data and 20 chars of data. To see
#   the exact match, use a puts action with ${$} replacing the matched text.
#
#   5. indent
#
#   how much to indent debug lines. Useful if calling flask from a callback
#   to give extra indentation if both main and callback have debug turned on
#   probably should include more debug parameters in a list, for max widths of
#   the various fields - but user can always just edit that
#
#   flask returns a list of statements. Each statement is a list of 
#   tokens. And each token is a list of 3 elements, an ID and 2 indices. So
#   it's a 3 level structure. For example:
#
#  { {ID 0 3} {String 4 10} }   { {ID 10 13} {String 14 20} } 
#
#    -token    ---token--         --token--   --token------
#  --------statement---------      --------statement---------
# ----------------------------return-----------------------------
#   
#   The Token ID is a string describing a token type; the 2 indices: start and end
#   are used to extract the token from $data, using [string range $data $start $end]
#    
#
#   The regextokens matrix has 4 columns as shown below.
#   
#   Column 1 is a label to indicate which of the regular expressions in
#   column 2 matched during the scan of the data. Column 3 is an action
#   to take when a match occurs, and column 4 is a comment. 
#   
#   The comment is required, but can be just an empty string. It should follow the rules
#   in tcl for text inside double quotes if used, so for example, braces need to
#   be balanced, and should not include any substitutions or command invocations
#   or they will be processed. Best to avoid these.
#   
#   The regular expressions will be tried one at a time from top to bottom. If
#   a match occurs, the action is taken and the following will not be attempted.
#   If it runs out of REs before it finds a match, it will quit. Typically one 
#   would write a catchall rule at the end to deal with that, perhaps with a callback.
#   Note: all regular expression will have the \A prepended so it's not needed in the rules. 
#   
#   The actions can be any one of these words:
#   
#   skip        - will match the regexp and move the scanning pointer past the token
#   token       - will match the regexp and create a token to be found in the result
#   eos         - this is the end of statement token, but is NOT output to the token list
#   eos+token   - this is the same as eos, except a token WILL be output
#   new         - this is a start statement that begins a new statement (or section)
#   new+token   - this is a start statement that begins a new statement THEN outputs the token
#
#   When using the new actions there will be an empty list element at the front which
#   can be ignored using a lrange $result 1 end
#
#   Any other text in the action field will be the same as a skip, which can facilitate
#   commenting out the action
#   
#   The action can also be a pair of items in a string list. The first must be one of the
#   above actions, and the second is a callback item. Whatever text is matched can be
#   accessed using ${$}. Here is an example: 

# The columns are

#   tokenID     Regex             action              a comment

set regextokens {
    WS+C        {[\s,]+}          skip                "skip over whitespace and commas"
    ID          {#\d+\s*=\s*}     {token {puts "the ID was '${$}'"}}    "Token Id and a callback"
    String      {'(''|[^'])*'}    token               "a string in single quotes"
    Zeugs       {[^;()', ]+}      token               "stuff"
    LP          {\(}              token               "left paren"
    RP          {\)}              token               "Right paren"
    SEMI        {;}               eos+token           "final result at end of statement"
}

# sample data to parse, from a STEP file

set data {
ISO-10303-21;
HEADER;
FILE_DESCRIPTION (( 'STEP AP214' ),
    '1' );
FILE_NAME ('Airp'' lane_ {V1};$ {} [] ; \ ,"-".STEP',
    '2019-11-26T14:28:03',
    ( '' ),
    ( '' ),
    'SwSTEP 2.0',
    'SolidWorks 2010',
    '' );
FILE_SCHEMA (( 'AUTOMOTIVE_''DESIGN' ));
ENDSEC;

DATA;
#1 = CARTESIAN_POINT ( 'NONE',  ( -3397.537578589738600, -40.70728434983968900, -279.1044191236024400 ) ) ;
#2 = CARTESIAN_POINT ( 'NONE',  ( 3983.737298227797500, 1647.263135894628500, 772.3224850880964100 ) ) ;
#3 = CARTESIAN_POINT ( 'NONE',  ( -457.2417019049098600, 5240.945876103178300, 87.77828949283561100 ) ) ;
#4 = CARTESIAN_POINT ( 'NONE',  ( -1338.327255407125900, -7674.784143274568100, 415.3493082692564800 ) ) ;
ENDSEC;
END-ISO-10303-21; extra junk at end

}

#set data {foobar;baz;} ;# test a minimal do nothing
#set data {foobar}
#set data {}

# ---------- load data file -------------
#set io [open d:/smaller.step r]
#set data [read $io]
#close $io

# ---------- doit data file -------------

set result [flask $regextokens $data yes]
displaytokens $result $data

Flask will always generate a list of lists of tokens. This means it has the ability to "parse" as well as tokenize. However, this parsing is limited to 2 levels. This can be used for simple sectioning, usefull for say, statements or lines in a file.

There are 2 ways to start a new section, using a section terminator, or a section starter. The difference is whether the token should be at the end of the token list or the beginning of the next one.

If no new sections are created, then all tokens output will be in the first and only section. In that case it will require a lindex $result 0 to get that list. This would be used when the data being parsed is more complex than just sections, and so the result is a single list of tokens. Then the program would need to do its own more detailed parse on the token stream.

Each matched token can also have an action callback. The match can be accessed with ${$} as will be shown below.

There is also a debug option. It will output with puts a line of information on each match, using a Unicode character in place of a newline to avoid messing up the display output. This output shows the token type, regex, position, and 20 characters forward. To see exactly what was matched, one can also use a puts in the action callback. The examples below will demonstrate this.

Examples

The following trivial data file, a take off on the windows configuration format, will be used with 3 styles of parsing. The data to parse is below and is stored in the variable 'data' which will be used in the below examples. Ordinarily, it would likely be read in from a file with appropriate code.

set data {

[Auto]
# comments here
Updates=1
News=1
;
[Request]
Updates=1
News=1
;
[epp]
Auto=1398
Version=812
File=\\\\tcl8.6.8\\generic\\tclStrToD.c
;

}

This file has sections starting at the square brackets and individual items follow that are name=value pairs. The sections end with a ; and there can be comments with # at the beginning of a line.

Style 1.

This generates a single linear list of tokens.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}      " "
    WS          {[\s]+}                 skip                                    "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         token                                   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token {puts "the ID was '${$}'"}}      "Token Id and a callback"
    SEMI        {;}                     skip                                    ""
} ;# this produces a single list of tokens

Comments, whitespace, and the semicolon are parsed, but simply skipped. Running flask as shown below, will do puts callbacks to output the matched text. The output using the displaytokens procedure follows.

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug

comment is  '# comments here'
the ID was 'Updates=1'
the ID was 'News=1'
the ID was 'Updates=1'
the ID was 'News=1'
the ID was 'Auto=1398'
the ID was 'Version=812'
the ID was 'File=\\\\tcl8.6.8\\generic\\tclStrToD.c'

%displaytokens $result $data

1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40} {SECT 44 52} {ITEM 54 62} {ITEM 64 69} {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|

The 1 above is the section number and is 10 tokens. 

% llength $result
1
% llength [lindex $result 0]
10

Style 2 using an end of section

The next method takes advantage of the semicolons in the data as an end of section indicator. The only difference from the previous one is that the SEMI rule action is eos. This causes it to finish off the current sub-list on each semicolon and begin a new empty one.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}      " "
    WS          {[\s]+}                 skip                                    "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         token                                   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token {puts "the ID was '${$}'"}}      "Token Id and a callback"
    SEMI        {;}                     eos                                     "semi colon for end"
} ;# this produces 2 levels, sections and lines by using a end of statement token, a SEMI

And the output:

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug
... same as above ...
%displaytokens $result $data


1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
2 {SECT 44 52} {ITEM 54 62} {ITEM 64 69}
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
3 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|


% llength $result
3
% llength [lindex $result 0]
3
% llength [lindex $result 1]
3
% llength [lindex $result 2]
4

Style 3 using new sections

This method would be used if there were no semicolons to indicate the end of a section, but rather the next section's first token would trigger the start of a new section. By using the new or new+token action this can be accomplished. When this method is used, there will always be one null section at the very beginning of the output. This can be removed using a lrange $result 1 end or just ignored.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}          " "
    WS          {[\s]+}                 skip                                        "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         {new+token {puts "Section was   '${$}'"}}   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token     {puts "     ID was '${$}'"}}     "Token Id and a callback"
    SEMI        {;}                     skip                                        "semi colon for end"
} ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning

And the results:

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug
... same as above ...
%displaytokens $result $data

1 
2 {SECT 2 7} {ITEM 25 33} {ITEM 35 40}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
3 {SECT 44 52} {ITEM 54 62} {ITEM 64 69}
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
4 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|

% llength $result
4
% llength [lindex $result 0]
0
% llength [lindex $result 1]
3
% llength [lindex $result 2]
3
% llength [lindex $result 3]
4

If the section starting tokens aren't required, say if they are simply a constant text, like "begin" or "data" one can use the action new instead of new+token to have them not included at the beginning of the sections.

proof of concept examples

For fun mostly, but perhaps useful for some real parsing, the below callback proc (2 versions) will use uplevel to access the variables in flask that are building up the token lists, and then inserts tokens into the list being built up. It is parsing of the var=value in the configuration data file.

Of the two below methods, the first is more robust, the second has a few glitches still, but only on bad data formats, and it's surprisingly difficult to parse a=b with just regex's when there can be spaces around the = and/or multiple = signs. It's supposed to just allow everything after the first = as the value, and it's not really supposed to allow spaces between the parm and the = but a true parser should be able to handle that as well. However, the way to use uplevel is correct in both.

proc do_item {item} { ;# proof of concept 1: to further parse the p=v item and insert tokens uplevel
    set eq [string first {=} $item] ;# find first equal sign
    if { $eq >= 0 } { ;# negative if not found
        set parm [string range $item 0 $eq-1]    ;# extract left side of =
        set value [string range $item $eq+1 end] ;# extract right side of =
    } else {
        error "bad item=value no = found $item" ;# should not happen, upper rules catch this
    }
    set parm [string trimright $parm]  ;# now trim
    set lenparm [string length $parm]  ;# len after trimming
    
    set start [uplevel {set start}] ;# get start and end of current token in the data
    set end   [uplevel {set end}]
#   set data  [uplevel {set data}] ;# get the entire input data (used for debug, not needed otherwise)
    
    # compute the abs indices of the parm=value
    set s1 $start
    set e1 [expr {   $start + $lenparm -1  }]
    
    set s2 [expr {   $start + $eq + 1   }]
    set e2 [expr {   $start + $eq + [string length $value]    }]
        
#   uplevel lappend resultline [list [list line $start $end]]   ;# insert a token for the whole line here if want to, we dont
    uplevel lappend resultline [list [list parm $s1 $e1]]       ;# insert a token for the parm
    uplevel lappend resultline [list [list value $s2 $e2]]      ;# insert a token for the value
    
}

The second version calls flask recursively from the callback to parse var=value - just to see if recursion works here.

proc do_item {item} { ;# proof of concept 2: parse further with recursion from a callback and insert same as above
#   take care with these rules, don't use one that is just rule* since this will also match a null
#   and so won't advance the scan pointer resulting in an infinite loop dying at memory exhaustion
#   I found that out, when I had one of the + as a *
#   
#   the ws is only allowed after the parm1 is found and before the =
#

    set rule {
        eq          {=}                    skip    ""
        item1       {[a-zA-Z0-9]+(?= *)}   token   "pos lookahead of optional spaces"
        ws          {\s+(?= *=)}           skip    "ws only allowed before the ="
        item2       {[^\n]+}               token   "if item1 dont match this should get it"
    }
    set result [flask $rule $item yes yes 8]
    if { [llength [lindex $result 0]] < 2} { ;# no value token since it can't match a null string, so only the parm token
        set result [list [list [lindex $result 0 0]  {value 0 -1} ]];# can't handle a rule that matches a null string, so we fake it
    }
    set start [uplevel {set start}] ;# get start of current token in the data
    
    set p1 [expr {   [lindex $result 0 0 1] + $start   }] ;# convert relative indices to absolutes in the real data
    set p2 [expr {   [lindex $result 0 0 2] + $start   }]
    set v1 [expr {   [lindex $result 0 1 1] + $start   }]
    set v2 [expr {   [lindex $result 0 1 2] + $start   }]
    
    uplevel lappend resultline [list [list parm2  $p1 $p2]]      ;# insert a token for the parm
    uplevel lappend resultline [list [list value2 $v1 $v2]]      ;# insert a token for the value
}

This is the grammar rules for the config file. When it matches a var=value it uses a callback to further parse that text which injects tokens into the token stream. Probably not something for the timid or in production.

This example also demonstrates a method to handle syntax errors. The algorithm of flask is to try each regex from top to bottom, stopping on the first rule that matches. If it were to find no matches at all, it would return from flask with the current token lists and not process any further. There would be no error reported. To handle that case one can provide a catchall rule that should match anything and throw an error in the callback. This will match any characters and spit out up to 10 chars of context. I'm not sure if that should be 0,10 or not. I didn't try it.

To test that, modify the input data to have a space before an = which the ITEM below does not allow. It could also be handled with a space in the char class to make it legal to do that.

# test commenting out the callbacks also
set regextokens {
    COMMENT     {#[^\n]*}               {skip {#puts "comment is  '${$}'"}}         " "
    WS          {[\s]+}                 skip                                        "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         {new+token {#puts "Section was  '${$}'"}}   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {skip      {do_item ${$}}}                  "var=value to parse in callback"
    SEMI        {;}                     skip                                        "semi colon for end"
    ERROR       {.{1,10}}               {skip       {error "error at '${$}'"}}      "catch all is an error if we get here"
} ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning


set result [flask $regextokens $data yes no]
if [catch {
    displaytokens $result $data
} err_code] {
    puts $err_code 
}

This final method seems to be the easiest approach now that capture groups have been added. So useful there are now 9 cap groups. Note that the regular expression has been changed for the ITEM rule.

proc do_item {item} { ;# proof of concept 1: to further parse the p=v using capture groups
 
         set cap1 [uplevel {set cap1}] ;# capture group 1 is the label
         set cap3 [uplevel {set cap3}] ;# capture group 3 is everything after the first =
#         puts "cap1= |$cap1| cap3= |$cap3| "
         
         uplevel lappend resultline [list [list Parm  {*}$cap1]]
         uplevel lappend resultline [list [list Value {*}$cap3]]
}

# This method uses capturing groups and the code uses cap1 and cap3. The cap* variables have the
# right indices and so makes it much easier. The number of cap groups has now been increased to 9.
# The rule uses a lazy quantifier on whitespace after a label, if present, up to (but only 1) equal sign.

#    ITEM   {([a-zA-Z0-9]+)(\s*=)?([^\n]*)}   {skip   {do_item ${$}}}    "var=value to parse in callback"

Here's what that looks like (Note: I modified the input after I noticed two sections had identical values, so this doesn't exactly match the above sample data, two values were changed from a 1 to a 2 in the first section. It also had a space after the = in the filename to test that). Either of the above callbacks produces the same thing, but done differently.

1 
2 {SECT 2 7} {parm 25 31} {value 33 33} {parm 35 38} {value 40 40}
          SECT 2 7                  ->          |[Auto]|
          parm 25 31                ->          |Updates|
          value 33 33               ->          |2|
          parm 35 38                ->          |News|
          value 40 40               ->          |2|
3 {SECT 44 52} {parm 55 61} {value 63 63} {parm 66 69} {value 71 71}
          SECT 44 52                ->          |[Request]|
          parm 55 61                ->          |Updates|
          value 63 63               ->          |1|
          parm 66 69                ->          |News|
          value 71 71               ->          |1|
4 {SECT 75 79} {parm 81 84} {value 86 89} {parm 91 97} {value 99 101} {parm 103 106} {value 108 142}
          SECT 75 79                ->          |[epp]|
          parm 81 84                ->          |Auto|
          value 86 89               ->          |1398|
          parm 91 97                ->          |Version|
          value 99 101              ->          |812|
          parm 103 106              ->          |File|
          value 108 142             ->          | \\\\tcl8.6.8\\generic\\tclStrToD.c|

Advanced topics

Callbacks

Typically a callback will be used to invoke a procedure. When this proc is called, it is running 1 
level below flask and so with a very simple use of [uplevel], all the local variables of flask can be 
accessed and even changed from the callback. Here is a list of the most useful ones. These can be 
accessed as follows,

set info [uplevel {info local}] ;# get a list of all the locals
puts "info= |$info| "
foreach v {rpos key RE actionlist comment match start end cap1 cap2 cap3 cap4}  { ;# most useful
        set val [uplevel "set $v"]
        puts [format {        %-10s = %s} $v |$val|  ]
}
set data  [uplevel {set data}] ;# get the entire input data but don't list it, too big

To access individual variables, one can use a statement like so (say to get start and end)

set start  [uplevel {set start}] ;# get start and store in callback local of same name
set end    [uplevel {set end}]   ;# get end   and do the same

Here are the variables and their uses:


data          The string variable, unchanging, that was passed into flask
rpos          This is the running position (as an index) into data

key           These are the 4 columns of the current rule that has matched
RE
actionlist
comment

match         The current match, a pair of indices, a string range on this pair will access the matched data
start         The first index in match
end           The second index in match

cap1          If there were any capturing groups, up to 9 are found in these 9 variables, -1 -1 for extras
cap2
cap3
cap4
...
cap9


resultline    This is the current sections (originally a section was a line) list of tokens
result        This is the list of lists, at each section end, resultline is added to result and resultline is cleared

When there are capturing groups used in the regex (portions of the RE in ()'s)  the portion of the capture will be
found in cap1..cap4. The groups that were not present in the RE will be assigned the values -1 -1

A callback can add tokens to the list that is being built up. A token is always a 3 element list. 
For example to add a token 

mytoken 5 10

to the current section (in resultline) one could do this,

uplevel lappend resultline [list [list mytoken 5 10]]  

Note that to be useful, the indices should be relative to the text found in the data variable. 
If one does some parsing of the matched string, say with a [string first] or [regex] statement
that returns indices, then one would typically need to add the offset found in the start
variable to make any sense. 

After a match, and AFTER the callback is made, the variable rpos is updated to point to
the next position in the text (in data) this is done as follows:

set rpos [expr {$end+1}] ;# shift

So, it is possible, though possibly risky, to modify rpos from the callback knowing it will 
be updated (immediately) after the return is made from the callback. This could be accomplished
by modifying the variable end before returning from the callback. 

And of course, all of this assumes that flask will not be modified, so all bets are off if
one does that. On the other hand, it's a small proc and the source code is provided.

Debugging

Flask has a tracing feature that is useful for debugging your rules. It's turned on with the 4th parameter to the flask proc. For example,

set result [flask $rules $data yes yes 3]

When calling flask the final 3 parameters are optional. The 3rd parameter defaults to yes (for flush) and so to enter 1 or both debug parameter (4th and 5th) you will have to supply a value for the 3rd parameter also. The 5th parameter provides a way to indent the output to make it easier to see any other output that might occur from callbacks. it defaults to 3.

Here is a sample display of the output:

   WS         \A[\s]+                                  (   0    1) |┊⤶⤶┊[Auto]⤶# comments here that are very lo|
   SECT       \A\[[a-zA-Z]+\]                          (   2    7) |┊[Auto]┊⤶# comments here that are very long|
   WS         \A[\s]+                                  (   8    8) |┊⤶┊# comments here that are very long indee|
   COMMENT    \A#[^\n]*                                (   9   72) |┊# comments here that are very long indeed will go past the limit┊|
comment is  '# comments here that are very long indeed will go past the limit'
   WS         \A[\s]+                                  (  73   73) |┊⤶┊Updates=2⤶News=2⤶;⤶[Request]⤶ Updates=1⤶|
   ITEM       \A([a-zA-Z0-9]+)(\s*=)?([^\n]*)          (  74   82) |┊Updates=2┊⤶News=2⤶;⤶[Request]⤶ Updates=1⤶ |
   WS         \A[\s]+                                  (  83   83) |┊⤶┊News=2⤶;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶|
   ITEM       \A([a-zA-Z0-9]+)(\s*=)?([^\n]*)          (  84   89) |┊News=2┊⤶;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶[|
   WS         \A[\s]+                                  (  90   90) |┊⤶┊;⤶[Request]⤶ Updates=1⤶ News=1⤶;⤶[epp]⤶A|

The first column is the token code, then the regular expression (notice the \A had been added automatically) followed by the start and end indices of the data matched.

Next is the text where the rule started its match inside vertical bars, but also the match itself is inside a pair of unicode vertical 3 piece bars. On windows, those 2 bars plus the text that matched will be colored red (but not on linux). All newlines are mapped to a unicode char for better visibility of the text.

All rules will be output, even if they don't produce tokens. And any callbacks that output text will be interspersed as well.

Of course since you have the source code, you can just find the [format] statements and change the column sizes if you want some other values.

Category Concept

Category Parsing