Sequence Notation

Brief Summary

JAL 2021-06-05: This is a small utility proc that can expand patterns of ranges or sets. It can be used to cross-product ranges or sets together using a compact notiation, e.g.

 expand_seq {[0-3] foo-{a,b}} 

expands to:

 0 foo-a
 0 foo-b
 1 foo-a
 1 foo-b
 2 foo-a
 2 foo-b
 3 foo-a
 3 foo-b

I think I have seen other utilities like this on this wiki, but I wasn't able to find them after a search. Maybe I couldn't pin down the right notation? Pointers or comments are very welcome! I'm sure this problem has been solved often. And the title can probably use some work too.

Background

I had a test program where I needed to run a large number of tests. Each of the test parameters in a series can be produced by expanding ranges or sets of values. E.g.

 test -size 1
 test -size 2
 test -size 3
 test -size 10
 test -size 20
 test -size 30
 test -size 100
 ... and so forth ...

The pattern would need to proceed to 3000000. Let's define and solve a simpler problem first. What if I want to specify a simple contiguous range of integers? E.g, 1 2 3 4 5. For this type of simple range, we will use notation [1-5]. OK so far.

Next logical step: what if we wanted to increment by a number other than 1? E.g. odd numbers from 1 up to (and including) 7. Notation: [1-7~2].

Pattern: [1-7~2] expands to:

 1 
 3
 5
 7

But note in my test invocations above, I can't just use a simple range increment. I need to expand logarithmically. We can solve this by allowing multiple ranges in one pattern, plus TCL expr's idea of 'e'. For simplicity, if there are multiple patterns, they will be expanded from left to right. If you like another order, I'd suggest maybe change the code to take the LAST match on each pattern, or just lsort -dictionary the resulting list of expansions.

Pattern: [1-3]e[1-6] expands to:

 1e1 = 10.0
 2e1 = 20.0
 3e1 = 30.0
 1e2 = 100.0
 2e2 = 200.0
 3e2 = 300.0
 ...
 3e7 = 3000000

So far so good again. Now, a new requirement. Some parameters are sets of letters or words, but otherwise should operate similarly to integer ranges as above. E.g. version numbers 1 to 3, with a suffix of a or b. New syntax: {a,b}. Pattern: [1-3]{a,b} expands to:

 1a
 2a
 3a
 1b
 2b
 3b

The order that this pattern produces, due to our policy of left-to-right expansion, may not be desired, eg., 2a follows 1a, where for some purposes it should be 1b that follows 1a. Let's save that problem for another day.

Next, what if I want to have a few different sub-patterns inside one pattern, e.g. for software version 3, there should be suffixes a, b, and c, but for versions 1 and 2, there is only a. Pattern: [1-2]a+3{a,b,c}

 1a
 2a
 3a
 3b
 3c

OK, so now my problem is pretty much solved. For any test parameter where I used to just accept an integer, I now allow an expansion. So I can target a large parameter space with a short specification, e.g.

Pattern: test -version [1-2]a+3{a,b,c} -size {1,2,3,8}e[1-6]

 test -version 1a -size 1e1
 test -version 2a -size 1e1
 test -version 3a -size 1e1
 ...

and so forth. Here is the code.

#
# seq_expand: Expand sequences, from ranges or sets.
# Takes an pattern and returns a list of expansions.
# version 1.2
#
# range expansion
# e.g.
#   [1-5]       = 1 2 3 4 5
#   [1-9~2]     = 1 3 5 7 9
#   [1-3]e[1-4] = 1 2 3 10 20 30 100 200 300
#
# set expansion
# e.g.
# {1,2,3}05{3,2} = 1053 1052 2053 2052
#
# alternate expansion
# e.g.
#   {a,b}+[1-3] = a b 1 2 3
#
# Caveats:
# ranges don't support decimals
#
proc expand_seq {seqexpr} {

    set pat_l [list $seqexpr]
    set exp_l []

    while {[llength $pat_l] > 0} {
        set pat [lindex $pat_l 0]
        set pat_l [lrange $pat_l 1 end]

        # any expansion elements we produce for current pattern
        set cross_l []

        # if any __word__ (contiguous non-WS) has a + in it, split it and prepend 
        # each sub-pattern.
        #
        # good pattern:
        if {[regexp -indices {.*(?:\t| |^)([^ \t]*[^\\ ](\+)[^ ]+).*} $pat _ alt_group plus]} {
            # puts "Found alt-group at $alt_group"
            lassign $alt_group sub_b sub_e
            set alt_group [string range $pat $sub_b $sub_e]
            # puts "Alt group is: '$alt_group'"
            set prev 0
            # split on any + that does not have a backslash before it.
            for {set i 0} {$i < [string length $alt_group]} {incr i} { 
                if {[string index $alt_group $i] == "+" && 
                    [string index $alt_group $i-1] != "\\"} {
                    # puts "Found + at $i"
                    lappend cross_l [string range $alt_group $prev $i-1]
                    set prev [expr $i + 1]
                }
            }
            # add everything past the last + to the end of $alt_group.
            if {$prev != 0} { 
                lappend cross_l [string range $alt_group $prev end]
            }
        }

        if {$cross_l == [] && [regexp {.*(\[([0-9]+)(-)([0-9]+)(~[0-9]+)?\]).*} $pat _ brackets d_b dash d_e d_s]} {
            set any_expansion 1
            set sub_b [string first $brackets $pat]
            if {$sub_b == -1} { error "Can't re-find bracket expr in pattern?" }
            set sub_e [expr $sub_b + [string length $brackets] - 1]

            set d_s [string trimleft $d_s ~]
            if {$d_s == ""} { set d_s 1 }

            foreach int [list $d_b $d_e $d_s] {
                if {![string is integer $int]} {
                    error "In sequence expansion, can't parse '$int' as an integer"
                }
            }
            if {$d_s == 0} { error "In sequence expansion, step size can't be 0" }
            for {set x $d_b} {$x <= $d_e} {incr x $d_s} {
                lappend cross_l $x
            }
        } 

        if {$cross_l == [] && [regexp -indices {.*(?:[^\\]|^)({[^\}]+}).*} $pat _ brackets]} {
            lassign $brackets sub_b sub_e
            set inside [string range $pat $sub_b+1 $sub_e-1]
            foreach elem [split $inside ,] {
                lappend cross_l $elem
            }
        }

        if {$cross_l == []} {
            # puts "no more expansion for '$pat' add to results"
            lappend exp_l $pat
        } else {
            # this one is finished.
            foreach elem $cross_l {
                set npat [string replace $pat $sub_b $sub_e $elem]
                # puts "pat $pat | $elem => $npat"
                lappend pat_l $npat
            }
        }
    }
    # fix any escaped + [ ] { } that might be found 
    # lmap is better, but for those on 8.5, a quick + dirty hack.
    set esc_map [list "\\+" "+" "\\\{" "\{" "\\\}" "\}" "\\\[" "\[" "\\\]" "\]"]
    set exp_l [string map $esc_map $exp_l]

    # puts "result patterns: \n[join $exp_l "\n"]"
    return $exp_l
}

Limitations

If you use this from a unix shell, note that shells often do glob expansion using [ ] and { }. So, if these patterns are to be given as command line input, users would need to quote the patterns to avoid the shell expanding them, e.g.

 bash> MyProgram -length '[8-10]{a,b,c}'

Also, if you were to write the patterns as literals in TCL, you'd want to enclose them with { and }, so that [ and ] are not expanded by TCL. Maybe a different syntax would be better.

You can certainly do a lot better if you attempted full parsing of the patterns. My code has problems with nested patterns. But nested patterns could be used (for example) so that '+' is not needed, eg.

 -version {[1-2]a,3{a,b,c}}

However, I didn't take the time to count for balanced curly braces. If anyone would like to implement it in a reasonably parsimonious way, feel free!

There are a lot of bugs and limitations related to nesting.


arjen - 2021-06-07 14:15:10

I have not looked in detail at the interface questions you raised, but as the number of combinations may be large, something you could try and add is a version that returns the combinations one by one. If you look up the various pages that deal with combinatorial procedures, you will find some examples.


JAL - 2021-06-07 20:10:56

Hi arjen, thanks a lot for the feedback! Changing it to have an incremental interface would be great, since in my use so far I do need to produce millions of combinations (consuming some memory), though I use it more in a batch mode so time to produce the first element isn't too critical. My first thought was that with TCL 8.6 coroutines, this would seem to be easy: just yield each element one by one in a main loop, as elements are being produced. Thinking about this a little deeper, though, However, this would require "depth first" production, where my simplistic code at the moment just expands one level at a time. So, in a redesign, I can imagine assigning each sub pattern (range or set) to a number field, and produce an iteration over integers that can be mapped to a particular selection of element from each sub pattern. E.g., if the pattern was {A,B,C,D}_{0,1,2,3}, then you can map integers (here presented in binary) as:

   00,00 -> A_0
   00,01 -> A_1
   00,10 -> A_2
   ..
   11,11 -> D_3

That would also give an easy path to solve another limitation, which is to be able to control which pattern to iterate over first, which at the moment just goes left to right over patterns (though + takes precendence over [ ]). I could just parse the pattern once, assign each sub-pattern to a sub-field, and give an option for left-to-right vs. right-to-left (etc.)