Version 34 of flask a mini-flex/lex proc

ET 2021-06-07 - The following little flex/lex proc was derived from a post on comp.lang.tcl by Christian Gollwitzer.

Intro to Flask

Flask is flex in a bottle. This means it is a self contained proc that can do some simple regular expression parsing in a technique similar to how the tools flex and lex are used. However, it does not read and write files, but rather is driven by its 2 input parameters only and parses data using the rules on every call.

The input is a list of rules and a text string to parse. The return value is a list of lists of tokens, and a token is a 3 element list with a token type and 2 text indices which point to the text location in the input-ed data file, such that a string range $data $start $end can be used to retrieve the text for a token.

It can be useful for a quick parse of simple files or those "little languages" that Brian Kernighan used to write about. And one can parse a couple different grammars at the same time.

The code section below is setup so you can copy/paste it into a windows console or linux terminal window (after running tclsh) for testing. It includes flask at the top and some test code at the bottom including some test data from a cad step file.

On windows or linux using rlwrap, the last 2 commands can be recalled by an up arrow. If you add a 4th parameter to the flask call as a true/yes it will re-parse with debug on so you can see what that looks like.

Also included is a simple debugging display tokens proc.

User Guide

Flask

   Flask is a mini-lex in a bottle. I.e. it's a self-conatained tcl proc
   with no global or namespace data (but feel free to rename including
   a namespace proc name) and does not need anything other than pure tcl
   with no additional packages required. It's based on the lex/flex tool.
   and derived from a post on comp.lang.tcl by Christian Gollwitzer.

Calling flask

   Flask takes 2 required arguments and 3 optional:
   
   Flask     regextokens data {flush yes} {debug no} {indent 3}
   
1. regextokens 
   
   This is a list of 4x elements, arranged in a matrix, of N x 4 with
   N rows and 4 columns (no limit on rows). All 4 columns must be present.
   
   The cells of the matrix are described below along with an example.
   
2. data
   
   This is a text string that represents the data to be parsed. If it came
   from a file, then it is simply all the text in the file as though read
   in in a singe [read $iochannel] statement.
   
3. flush 
   
   This is an optional argument, with a default value of true. If there is
   any extra text beyond the last eos token (the one that terminates the scan) then
   it/they will be flushed as tokens if this is true the default.

4. debug

   If this is true, then at each token a [puts] is output with info about the
   token type, regex, position in the input data and 20 chars of data. To see
   the exact match, use a puts action with ${$} replacing the matched text.

5. indent

   how much to indent debug lines. Useful if calling flask from a callback
   to give extra indentation if both main and callback have debug turned on
   probably should include more debug parameters in a list, for max widths of
   the various fields - but user can always just edit that

Output

   flask returns a list of statements. Each statement is a list of 
   tokens. And each token is a list of 3 elements, an ID and 2 indices. So
   it's a 3 level structure. For example:

Returned Tokens - structure



  { {ID 0 3} {String 4 10} }   { {ID 10 13} {String 14 20} } 

    -token    ---token--         --token--   --token------
  --------statement---------      --------statement---------
 ----------------------------return-----------------------------
   
   The Token ID is a string describing a token type; the 2 indices: start and end
   are used to extract the token from $data, using [string range $data $start $end]
    

   The regextokens matrix has 4 columns.

       1          2                 3                     4
    tokenID     Regex             action              a comment
   
   Column 1 is a label to indicate which of the regular expressions in
   column 2 matched during the scan of the data. Column 3 is an action
   to take when a match occurs, and column 4 is a comment. 
   
   The comment is required, but can be just an empty string. It's part of
   the matrix (really a list) but is not used for anything. However, be
   aware that the usual rules for balancing braces etc. need be kept.

Flask processing algorithm

   
   The regular expressions will be tried one at a time from top to bottom. If
   a match occurs, the action is taken and the following will not be attempted.
   If it runs out of REs before it finds a match, it will quit. Typically one 
   would write a catchall rule at the end to deal with that, perhaps with a callback.
   Note: all regular expression will have the \A prepended so it's not needed in the rules. 
   
   The actions can be any one of these words:
   
   skip        - will match the regexp and move the scanning pointer past the token
   token       - will match the regexp and create a token to be found in the result
   eos         - this is the end of statement token, but is NOT output to the token list
   eos+token   - this is the same as eos, except a token WILL be output
   new         - this is a start statement that begins a new statement (or section)
   new+token   - this is a start statement that begins a new statement THEN outputs the token

   When using the new actions there will be an empty list element at the front which
   can be ignored using a lrange $result 1 end

   Any other text in the action field will be the same as a skip, which can facilitate
   commenting out the action
   
   The action can also be a pair of items in a string list. The first must be one of the
   above actions, and the second is a callback item. Whatever text is matched can be
   accessed using ${$}. Here is an example:

Example grammar Matrix



    tokenID     Regex             action              a comment

set regextokens {
    WS+C        {[\s,]+}          skip                "skip over whitespace and commas"
    ID          {#\d+\s*=\s*}     {token {puts "the ID was '${$}'"}}    "Token Id and a callback"
    String      {'(''|[^'])*'}    token               "a string in single quotes"
    Zeugs       {[^;()', ]+}      token               "stuff"
    LP          {\(}              token               "left paren"
    RP          {\)}              token               "Right paren"
    SEMI        {;}               eos+token           "final result at end of statement"
}

Sample call

set result [flask $regextokens $data] ;# parse and return tokens into result

displaytokens $result $data           ;# debugging print out and example for traversing the result

Code

# used to debug the output
proc displaytokens {tokens data} {
    set l 0
    foreach line $tokens {
        puts stderr "[incr l] $line"
        foreach token $line {
            lassign $token id from to
            puts [format "    %-17s  ->   |%s|"  $token [string range $data $from $to]]
#            if { [incr count] > 100 } { ;# use this to limit output for a large file
#               return
#            }
#            update
        }   
    }
}

# description follows, along with an example grammar spec


proc flask {regextokens data {flush yes} {debug no} {indent 3}} { ;# indent is for debug in case of recursive calls
    
    # rpos is the running position where to read the next token
    set rpos 0
    
    set result {}
    set resultline {}
    set eos 0
    set newtokens [list]
    # copy the input rules and add a \A to each r.e.
    foreach {key RE actionlist comment} $regextokens {
        lappend newtokens $key "\\A$RE"  $actionlist $comment
    }
    while true {
        set found false
        
        foreach {key RE actionlist comment} $newtokens {
            if {[regexp -indices -start $rpos  $RE $data match]} {
                lassign $match start end
                if { $debug } { ;# map newlines to a unicode char, use stderr to colorize the matched portion (windows only)
                    set v1 [string range $data $rpos [expr {   $rpos+$end-$start     }]] 
                    set v2 [string range $data       [expr {   $rpos+$end-$start+1   }]   $rpos+40] 
                    regsub -all {\n} $v1 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v1 ;# or 21B2
                    regsub -all {\n} $v2 [apply {code {eval set str "\\u[string map "U+ {}" $code]"}} 2936] v2 ;# or 21B2
                    puts -nonewline  [format {%s%-10s %-40s (%4d %4d) |} [string repeat " " $indent] $key $RE $rpos $end]
                    if { $::tcl_platform(platform) eq "windows" } {
                        puts -nonewline stderr $v1
                    } else {
                        puts -nonewline $v1
                    }
                    puts "$v2|"
#                   update
                }
                set action [lindex $actionlist 0] ;# if a list, action first, then callback
                if { $action eq "token" } {
                    lappend resultline [list $key {*}$match]
                } elseif {$action eq "eos+token"} {
                    lappend resultline [list $key {*}$match]
                    set eos 1
                } elseif { $action eq "eos" } {
                    set eos 1
                } elseif { $action eq "new+token" } {
                    lappend result $resultline
                    set resultline [list]
                    lappend resultline [list $key {*}$match]
                } elseif { $action eq "new" } {
                    lappend result $resultline
                    set resultline [list]
                }

                if { [llength $actionlist] > 1 } {
                    set callback [lindex $actionlist 1]
                    set $ [string range $data $start $end]
                    eval $callback
                }
                set rpos [expr {$end+1}] ;# shift
                set found true
                break
            }
        }
        
        if {$found} {
            # minimal bottom up parsing
            # for Token designated as eos end line/statement
            if {$eos} {
                lappend result $resultline
                set resultline {}
                set eos 0
#               puts "end of statement"
            }
            
        } else {
            # nothing matched any longer
            if { $resultline ne {} && $flush} {
                lappend result $resultline
            }
#           puts "Parsing stopped"
            break
        }
    }
    
    return $result
}


#   Flask is a mini-lex in a bottle. I.e. it's a self-conatained tcl proc
#   with no global or namespace data (but feel free to rename including
#   a namespace proc name) and does not need anything other than pure tcl
#   with no additional packages required. It's based on the lex/flex tool.
#   and derived from a post on comp.lang.tcl by Christian Gollwitzer.
#
#   It takes 2 required arguments and 2 optional:
#   
#       regextokens data {flush yes} {debug no} {indent 3}
#   
#   1. regextokens 
#   
#   This is a list of 4x elements, arranged in a matrix, of N x 4 with
#   N rows and 4 columns (no limit on rows). All 4 columns must be present.
#   
#   The cells of the matrix are described below along with an example.
#   
#   2. data
#   
#   This is a text string that represents the data to be parsed. If it came
#   from a file, then it is simply all the text in the file as though read
#   in in a singe [read $iochannel] statement.
#   
#   3. flush 
#   
#   This is an optional argument, with a default value of true. If there is
#   any extra text beyond the last eos token (the one that terminates the scan) then
#   it/they will be flushed as tokens if this is true the default.
#
#   4. debug
#
#   If this is true, then at each token a [puts] is output with info about the
#   token type, regex, position in the input data and 20 chars of data. To see
#   the exact match, use a puts action with ${$} replacing the matched text.
#
#   5. indent
#
#   how much to indent debug lines. Useful if calling flask from a callback
#   to give extra indentation if both main and callback have debug turned on
#   probably should include more debug parameters in a list, for max widths of
#   the various fields - but user can always just edit that
#
#   flask returns a list of statements. Each statement is a list of 
#   tokens. And each token is a list of 3 elements, an ID and 2 indices. So
#   it's a 3 level structure. For example:
#
#  { {ID 0 3} {String 4 10} }   { {ID 10 13} {String 14 20} } 
#
#    -token    ---token--         --token--   --token------
#  --------statement---------      --------statement---------
# ----------------------------return-----------------------------
#   
#   The Token ID is a string describing a token type; the 2 indices: start and end
#   are used to extract the token from $data, using [string range $data $start $end]
#    
#
#   The regextokens matrix has 4 columns as shown below.
#   
#   Column 1 is a label to indicate which of the regular expressions in
#   column 2 matched during the scan of the data. Column 3 is an action
#   to take when a match occurs, and column 4 is a comment. 
#   
#   The comment is required, but can be just an empty string. It should follow the rules
#   in tcl for text inside double quotes if used, so for example, braces need to
#   be balanced, and should not include any substitutions or command invocations
#   or they will be processed. Best to avoid these.
#   
#   The regular expressions will be tried one at a time from top to bottom. If
#   a match occurs, the action is taken and the following will not be attempted.
#   If it runs out of REs before it finds a match, it will quit. Typically one 
#   would write a catchall rule at the end to deal with that, perhaps with a callback.
#   Note: all regular expression will have the \A prepended so it's not needed in the rules. 
#   
#   The actions can be any one of these words:
#   
#   skip        - will match the regexp and move the scanning pointer past the token
#   token       - will match the regexp and create a token to be found in the result
#   eos         - this is the end of statement token, but is NOT output to the token list
#   eos+token   - this is the same as eos, except a token WILL be output
#   new         - this is a start statement that begins a new statement (or section)
#   new+token   - this is a start statement that begins a new statement THEN outputs the token
#
#   When using the new actions there will be an empty list element at the front which
#   can be ignored using a lrange $result 1 end
#
#   Any other text in the action field will be the same as a skip, which can facilitate
#   commenting out the action
#   
#   The action can also be a pair of items in a string list. The first must be one of the
#   above actions, and the second is a callback item. Whatever text is matched can be
#   accessed using ${$}. Here is an example: 

# The columns are

#   tokenID     Regex             action              a comment

set regextokens {
    WS+C        {[\s,]+}          skip                "skip over whitespace and commas"
    ID          {#\d+\s*=\s*}     {token {puts "the ID was '${$}'"}}    "Token Id and a callback"
    String      {'(''|[^'])*'}    token               "a string in single quotes"
    Zeugs       {[^;()', ]+}      token               "stuff"
    LP          {\(}              token               "left paren"
    RP          {\)}              token               "Right paren"
    SEMI        {;}               eos+token           "final result at end of statement"
}

# sample data to parse, from a STEP file

set data {
ISO-10303-21;
HEADER;
FILE_DESCRIPTION (( 'STEP AP214' ),
    '1' );
FILE_NAME ('Airp'' lane_ {V1};$ {} [] ; \ ,"-".STEP',
    '2019-11-26T14:28:03',
    ( '' ),
    ( '' ),
    'SwSTEP 2.0',
    'SolidWorks 2010',
    '' );
FILE_SCHEMA (( 'AUTOMOTIVE_''DESIGN' ));
ENDSEC;

DATA;
#1 = CARTESIAN_POINT ( 'NONE',  ( -3397.537578589738600, -40.70728434983968900, -279.1044191236024400 ) ) ;
#2 = CARTESIAN_POINT ( 'NONE',  ( 3983.737298227797500, 1647.263135894628500, 772.3224850880964100 ) ) ;
#3 = CARTESIAN_POINT ( 'NONE',  ( -457.2417019049098600, 5240.945876103178300, 87.77828949283561100 ) ) ;
#4 = CARTESIAN_POINT ( 'NONE',  ( -1338.327255407125900, -7674.784143274568100, 415.3493082692564800 ) ) ;
ENDSEC;
END-ISO-10303-21; extra junk at end

}

#set data {foobar;baz;} ;# test a minimal do nothing
#set data {foobar}
#set data {}

# ---------- load data file -------------
#set io [open d:/smaller.step r]
#set data [read $io]
#close $io

# ---------- doit data file -------------

set result [flask $regextokens $data yes]
displaytokens $result $data

Flask will always generate a list of lists of tokens. This means it has the ability to "parse" as well as tokenize. However, this parsing is limited to 2 levels. This can be used for simple sectioning, usefull for say, statements or lines in a file.

There are 2 ways to start a new section, using a section terminator, or a section starter. The difference is whether the token should be at the end of the token list or the beginning of the next one.

If no new sections are created, then all tokens output will be in the first and only section. In that case it will require a lindex $result 0 to get that list. This would be used when the data being parsed is more complex than just sections, and so the result is a single list of tokens. Then the program would need to do its own more detailed parse on the token stream.

Each matched token can also have an action callback. The match can be accessed with ${$} as will be shown below.

There is also a debug option. It will output with puts a line of information on each match, using a Unicode character in place of a newline to avoid messing up the display output. This output shows the token type, regex, position, and 20 characters forward. To see exactly what was matched, one can also use a puts in the action callback. The examples below will demonstrate this.

Examples

The following trivial data file, a take off on the windows configuration format, will be used with 3 styles of parsing. The data to parse is below and is stored in the variable 'data' which will be used in the below examples. Ordinarily, it would likely be read in from a file with appropriate code.

set data {

[Auto]
# comments here
Updates=1
News=1
;
[Request]
Updates=1
News=1
;
[epp]
Auto=1398
Version=812
File=\\\\tcl8.6.8\\generic\\tclStrToD.c
;

}

This file has sections starting at the square brackets and individual items follow that are name=value pairs. The sections end with a ; and there can be comments with # at the beginning of a line.

Style 1.

This generates a single linear list of tokens.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}      " "
    WS          {[\s]+}                 skip                                    "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         token                                   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token {puts "the ID was '${$}'"}}      "Token Id and a callback"
    SEMI        {;}                     skip                                    ""
} ;# this produces a single list of tokens

Comments, whitespace, and the semicolon are parsed, but simply skipped. Running flask as shown below, will do puts callbacks to output the matched text. The output using the displaytokens procedure follows.

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug

comment is  '# comments here'
the ID was 'Updates=1'
the ID was 'News=1'
the ID was 'Updates=1'
the ID was 'News=1'
the ID was 'Auto=1398'
the ID was 'Version=812'
the ID was 'File=\\\\tcl8.6.8\\generic\\tclStrToD.c'

%displaytokens $result $data

1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40} {SECT 44 52} {ITEM 54 62} {ITEM 64 69} {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|

The 1 above is the section number and is 10 tokens. 

% llength $result
1
% llength [lindex $result 0]
10

Style 2 using an end of section

The next method takes advantage of the semicolons in the data as an end of section indicator. The only difference from the previous one is that the SEMI rule action is eos. This causes it to finish off the current sub-list on each semicolon and begin a new empty one.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}      " "
    WS          {[\s]+}                 skip                                    "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         token                                   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token {puts "the ID was '${$}'"}}      "Token Id and a callback"
    SEMI        {;}                     eos                                     "semi colon for end"
} ;# this produces 2 levels, sections and lines by using a end of statement token, a SEMI

And the output:

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug
... same as above ...
%displaytokens $result $data


1 {SECT 2 7} {ITEM 25 33} {ITEM 35 40}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
2 {SECT 44 52} {ITEM 54 62} {ITEM 64 69}
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
3 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|


% llength $result
3
% llength [lindex $result 0]
3
% llength [lindex $result 1]
3
% llength [lindex $result 2]
4

Style 3 using new sections

This method would be used if there were no semicolons to indicate the end of a section, but rather the next section's first token would trigger the start of a new section. By using the new or new+token action this can be accomplished. When this method is used, there will always be one null section at the very beginning of the output. This can be removed using a lrange $result 1 end or just ignored.

set regextokens {
    COMMENT     {#[^\n]*}               {skip {puts "comment is  '${$}'"}}          " "
    WS          {[\s]+}                 skip                                        "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         {new+token {puts "Section was   '${$}'"}}   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {token     {puts "     ID was '${$}'"}}     "Token Id and a callback"
    SEMI        {;}                     skip                                        "semi colon for end"
} ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning

And the results:

%set result [flask $regextokens $data yes no] ;#flush any extra, no debug
... same as above ...
%displaytokens $result $data

1 
2 {SECT 2 7} {ITEM 25 33} {ITEM 35 40}
      SECT 2 7           ->   |[Auto]|
      ITEM 25 33         ->   |Updates=1|
      ITEM 35 40         ->   |News=1|
3 {SECT 44 52} {ITEM 54 62} {ITEM 64 69}
      SECT 44 52         ->   |[Request]|
      ITEM 54 62         ->   |Updates=1|
      ITEM 64 69         ->   |News=1|
4 {SECT 73 77} {ITEM 79 87} {ITEM 89 99} {ITEM 101 139}
      SECT 73 77         ->   |[epp]|
      ITEM 79 87         ->   |Auto=1398|
      ITEM 89 99         ->   |Version=812|
      ITEM 101 139       ->   |File=\\\\tcl8.6.8\\generic\\tclStrToD.c|

% llength $result
4
% llength [lindex $result 0]
0
% llength [lindex $result 1]
3
% llength [lindex $result 2]
3
% llength [lindex $result 3]
4

If the section starting tokens aren't required, say if they are simply a constant text, like "begin" or "data" one can use the action new instead of new+token to have them not included at the beginning of the sections.

proof of concept examples

For fun mostly, but perhaps useful for some real parsing, the below callback proc (2 versions) will use uplevel to access the variables in flask that are building up the token lists, and then inserts tokens into the list being built up. It is parsing of the var=value in the configuration data file.

Of the two below methods, the first is more robust, the second has a few glitches still, but only on bad data formats, and it's surprisingly difficult to parse a=b with just regex's when there can be spaces around the = and/or multiple = signs. It's supposed to just allow everything after the first = as the value, and it's not really supposed to allow spaces between the parm and the = but a true parser should be able to handle that as well. However, the way to use uplevel is correct in both.

proc do_item {item} { ;# proof of concept 1: to further parse the p=v item and insert tokens uplevel
    set eq [string first {=} $item] ;# find first equal sign
    if { $eq >= 0 } { ;# negative if not found
        set parm [string range $item 0 $eq-1]    ;# extract left side of =
        set value [string range $item $eq+1 end] ;# extract right side of =
    } else {
        error "bad item=value no = found $item" ;# should not happen, upper rules catch this
    }
    set parm [string trimright $parm]  ;# now trim
    set lenparm [string length $parm]  ;# len after trimming
    
    set start [uplevel {set start}] ;# get start and end of current token in the data
    set end   [uplevel {set end}]
#   set data  [uplevel {set data}] ;# get the entire input data (used for debug, not needed otherwise)
    
    # compute the abs indices of the parm=value
    set s1 $start
    set e1 [expr {   $start + $lenparm -1  }]
    
    set s2 [expr {   $start + $eq + 1   }]
    set e2 [expr {   $start + $eq + [string length $value]    }]
        
#   uplevel lappend resultline [list [list line $start $end]]   ;# insert a token for the whole line here if want to, we dont
    uplevel lappend resultline [list [list parm $s1 $e1]]       ;# insert a token for the parm
    uplevel lappend resultline [list [list value $s2 $e2]]      ;# insert a token for the value
    
}

The second version calls flask recursively from the callback to parse var=value - just to see if recursion works here.

proc do_item {item} { ;# proof of concept 2: parse further with recursion from a callback and insert same as above
#   take care with these rules, don't use one that is just rule* since this will also match a null
#   and so won't advance the scan pointer resulting in an infinite loop dying at memory exhaustion
#   I found that out, when I had one of the + as a *
#   
#   the ws is only allowed after the parm1 is found and before the =
#

    set rule {
        eq          {=}                    skip    ""
        item1       {[a-zA-Z0-9]+(?= *)}   token   "pos lookahead of optional spaces"
        ws          {\s+(?= *=)}           skip    "ws only allowed before the ="
        item2       {[^\n]+}               token   "if item1 dont match this should get it"
    }
    set result [flask $rule $item yes yes 8]
    if { [llength [lindex $result 0]] < 2} { ;# no value token since it can't match a null string, so only the parm token
        set result [list [list [lindex $result 0 0]  {value 0 -1} ]];# can't handle a rule that matches a null string, so we fake it
    }
    set start [uplevel {set start}] ;# get start of current token in the data
    
    set p1 [expr {   [lindex $result 0 0 1] + $start   }] ;# convert relative indices to absolutes in the real data
    set p2 [expr {   [lindex $result 0 0 2] + $start   }]
    set v1 [expr {   [lindex $result 0 1 1] + $start   }]
    set v2 [expr {   [lindex $result 0 1 2] + $start   }]
    
    uplevel lappend resultline [list [list parm2  $p1 $p2]]      ;# insert a token for the parm
    uplevel lappend resultline [list [list value2 $v1 $v2]]      ;# insert a token for the value
}

This is the grammar rules for the config file. When it matches a var=value it uses a callback to further parse that text which injects tokens into the token stream. Probably not something for the timid or in production.

This example also demonstrates a method to handle syntax errors. The algorithm of flask is to try each regex from top to bottom, stopping on the first rule that matches. If it were to find no matches at all, it would return from flask with the current token lists and not process any further. There would be no error reported. To handle that case one can provide a catchall rule that should match anything and throw an error in the callback. This will match any characters and spit out up to 10 chars of context. I'm not sure if that should be 0,10 or not. I didn't try it.

To test that, modify the input data to have a space before an = which the ITEM below does not allow. It could also be handled with a space in the char class to make it legal to do that.

# test commenting out the callbacks also
set regextokens {
    COMMENT     {#[^\n]*}               {skip {#puts "comment is  '${$}'"}}         " "
    WS          {[\s]+}                 skip                                        "skip over whitespace"
    SECT        {\[[a-zA-Z]+\]}         {new+token {#puts "Section was  '${$}'"}}   "a section header"
    ITEM        {[a-zA-Z0-9]+=[^\n]*}   {skip      {do_item ${$}}}                  "var=value to parse in callback"
    SEMI        {;}                     skip                                        "semi colon for end"
    ERROR       {.{1,10}}               {skip       {error "error at '${$}'"}}      "catch all is an error if we get here"
} ;# this also produces 2 levels, but starts a new section on the SECT token, but has an extra null section in the beginning


set result [flask $regextokens $data yes no]
if [catch {
    displaytokens $result $data
} err_code] {
    puts $err_code 
}

Here's what that looks like (Note: I modified the input after I noticed two sections had identical values, so this doesn't exactly match the above sample data, two values were changed from a 1 to a 2 in the first section. It also had a space after the = in the filename to test that). Either of the above callbacks produces the same thing, but done differently.

1 
2 {SECT 2 7} {parm 25 31} {value 33 33} {parm 35 38} {value 40 40}
          SECT 2 7                  ->          |[Auto]|
          parm 25 31                ->          |Updates|
          value 33 33               ->          |2|
          parm 35 38                ->          |News|
          value 40 40               ->          |2|
3 {SECT 44 52} {parm 55 61} {value 63 63} {parm 66 69} {value 71 71}
          SECT 44 52                ->          |[Request]|
          parm 55 61                ->          |Updates|
          value 63 63               ->          |1|
          parm 66 69                ->          |News|
          value 71 71               ->          |1|
4 {SECT 75 79} {parm 81 84} {value 86 89} {parm 91 97} {value 99 101} {parm 103 106} {value 108 142}
          SECT 75 79                ->          |[epp]|
          parm 81 84                ->          |Auto|
          value 86 89               ->          |1398|
          parm 91 97                ->          |Version|
          value 99 101              ->          |812|
          parm 103 106              ->          |File|
          value 108 142             ->          | \\\\tcl8.6.8\\generic\\tclStrToD.c|

Category Concept

Category Parsing