[Richard Suchenwirth] 2013-11-30 - Another chapter in the [dis2asm] saga: The Tcl compiler converts [proc] bodies into [bytecode]. With tcl::unsupported::'''disassemble''' we can inspect the generated code in "dis" [assembler] notation. With tcl::unsupported::'''assemble''' we can convert a (somewhat different) assembler notation "[TAL]" to bytecode again. The job of [dis2asm] is to convert a string in "dis" to another string in TAL so the original proc works equally well.

Looking at the [dis2asm] output, we sometimes notice that it produces code that could be better - in other words, "optimized": doing the same job in less bytecodes, and hence (at least marginally) less time. Examples:
   push {}
   pop
is an absolutely redundant piece of code: first something is pushed on the stack, and then immediately popped off again. No effect, but 3 bytes in bytecode, and just a little more time needed to run. Or, the second and third line of
   jump Lxx
 label Lyy;
   jump Lzz
are also redundant: because the code above jumps away, Lyy can only be reached by code that explicitly jumps there - only to be redirected to Lzz again. It might as well have directly jumped to Lzz.

The code shown on this page deals with such issues. It is a postprocessor which converts [dis2asm] output to another string in the same TAL language, but optimized where possible. To test it, I have extended the ''aproc'' wrapper to accept an -o flag and if present, to run the optimizer on the TAL output:
======
proc aproc {name argl body args} {
    proc $name $argl $body
    set res [disasm proc $name]
    if {"-x" in $args} {
        set res [list proc $name $argl [list asm [dis2asm $res]]]
        if {"-o" in $args} {set res [optimize $res]}
        eval $res
    }
    return $res
}
======
The '''optimize''' proc is a little longer than that, even though it currently just handles the few cases discussed above (and below). It splits the TAL input into a list of lines and iterates over them with [for], so it can also operate on other than the current line. Lines considered redundant are first marked with the prefix "#o", and removed after one pass, so that indexes don't get confused.

In testing conditions, the previous and current instructions are put together in a string - the "peephole". I think e.g. "push pop" is quite self-documenting for a peephole condition... :^)
======
proc optimize tal {
    set last  ""
    set last2 ""
    set lines [split $tal \n]
    for {set i 0} {$i < [llength $lines]} {incr i} {
        set instr [regexp -inline {[A-Za-z0-9_]+} [lindex $lines $i]]
        if {"$last $instr" eq "push pop"} {
            lset lines $i-1 #o[lindex $lines $i-1] ;# mark for deletion
            lset lines $i   #o[lindex $lines $i]
        } elseif {"$last $instr" eq "jump jump"} { ;# unreachable jump
            lset lines $i   #o[lindex $lines $i]
        }
        set last $instr
    }
    while 1 {
        set tmp {} ;# remove marked lines
        foreach line $lines {if ![string match #o* $line] {lappend tmp $line}}
        set lines $tmp
        set found 0
        for {set i 0} {$i < [llength $lines]} {incr i} {
            set instr [regexp -inline {[A-Za-z0-9_]+} [lindex $lines $i]]
            if {"$last2 $last $instr" eq "jump label jump"} {
                set oldTrg [string trimright [lindex $lines $i-1 1] ";"]
                set newTrg [lindex $lines $i   1]
                lset lines $i-1 #o[lindex $lines $i-1] ;# mark for deletion
                lset lines $i   #o[lindex $lines $i]
                set found 1
                break
            }
            set last2 $last
            set last  $instr
        }
        if $found {
            set tmp {} ;# remove marked lines
            foreach line $lines {
                if [regexp "jump.* $oldTrg " $line] {
                    set line [string map [list " $oldTrg " " $newTrg "] $line]
                }
                if ![string match #o* $line] {lappend tmp $line}
            }
            set lines $tmp
        } else break
    }
    return [join $lines \n]
}
======
Testing: first unoptimized TAL...
======
% aproc f x {foreach i {a b} {foreach j $x {puts $i,$j}}} -x 
proc f x {asm {
   push {a b}               ;# (0) push1 0         # "a b"
   store 1                  ;# (2) storeScalar1 %v1         # temp var 1
   pop                      ;# (4) pop
  push -1; store 2; pop    ;# (5) foreach_start4 0
 label L10;
   incrImm 2 +1;load 1;load 2
   listIndex;store i;pop
   load 1;listLength;lt     ;# (10) foreach_step4 0
   jumpFalse L63            ;# (15) jumpFalse1 +48         # pc 63
                            ;# (17) startCommand +43 1         # next cmd at pc 60
   load x                   ;# (26) loadScalar1 %v0         # var "x"
   store 4                  ;# (28) storeScalar1 %v4         # temp var 4
   pop                      ;# (30) pop
  push -1; store 5; pop    ;# (31) foreach_start4 1
 label L36;
   incrImm 5 +1;load 4;load 5
   listIndex;store j;pop
   load 4;listLength;lt     ;# (36) foreach_step4 1
   jumpFalse L58            ;# (41) jumpFalse1 +17         # pc 58
   push puts                ;# (43) push1 1         # "puts"
   load i                   ;# (45) loadScalar1 %v3         # var "i"
   push ,                   ;# (47) push1 2         # ","
   load j                   ;# (49) loadScalar1 %v6         # var "j"
   concat 3                 ;# (51) concat1 3
   invokeStk 2              ;# (53) invokeStk1 2
   pop                      ;# (55) pop
   jump L36                 ;# (56) jump1 -20         # pc 36
 label L58;
   push {}                  ;# (58) push1 3         # ""
   pop                      ;# (60) pop
   jump L10                 ;# (61) jump1 -51         # pc 10
 label L63;
   push {}                  ;# (63) push1 3         # ""
                            ;# (65) done
 label Done;
}}
======
and now, with the -o switch added, the optimized version:
======
% aproc f x {foreach i {a b} {foreach j $x {puts $i,$j}}} -x -o
proc f x {asm {
   push {a b}               ;# (0) push1 0         # "a b"
   store 1                  ;# (2) storeScalar1 %v1         # temp var 1
   pop                      ;# (4) pop
  push -1; store 2; pop    ;# (5) foreach_start4 0
 label L10;
   incrImm 2 +1;load 1;load 2
   listIndex;store i;pop
   load 1;listLength;lt     ;# (10) foreach_step4 0
   jumpFalse L63            ;# (15) jumpFalse1 +48         # pc 63
                            ;# (17) startCommand +43 1         # next cmd at pc 60
   load x                   ;# (26) loadScalar1 %v0         # var "x"
   store 4                  ;# (28) storeScalar1 %v4         # temp var 4
   pop                      ;# (30) pop
  push -1; store 5; pop    ;# (31) foreach_start4 1
 label L36;
   incrImm 5 +1;load 4;load 5
   listIndex;store j;pop
   load 4;listLength;lt     ;# (36) foreach_step4 1
   jumpFalse L10            ;# (41) jumpFalse1 +17         # pc 58
   push puts                ;# (43) push1 1         # "puts"
   load i                   ;# (45) loadScalar1 %v3         # var "i"
   push ,                   ;# (47) push1 2         # ","
   load j                   ;# (49) loadScalar1 %v6         # var "j"
   concat 3                 ;# (51) concat1 3
   invokeStk 2              ;# (53) invokeStk1 2
   pop                      ;# (55) pop
   jump L36                 ;# (56) jump1 -20         # pc 36
 label L63;
   push {}                  ;# (63) push1 3         # ""
                            ;# (65) done
 label Done;
}}
======
Lines marked 58 to 61 are gone, line 41 now jumps directly to L10... but does it still work as before?
======
% f {0 1 2}
a,0
a,1
a,2
b,0
b,1
b,2
======
Another potential for optimization appears in the following test:
======
% aproc f x {foreach i $x {if {$i eq "b"} continue;puts $i}} -x
proc f x {asm {
   load x                   ;# (0) loadScalar1 %v0         # var "x"
   store 1                  ;# (2) storeScalar1 %v1         # temp var 1
   pop                      ;# (4) pop
   push -1; store 2; pop    ;# (5) foreach_start4 0
 label L10;
   incrImm 2 +1;load 1;load 2
   listIndex;store i;pop
   load 1;listLength;lt     ;# (10) foreach_step4 0
   jumpFalse L61            ;# (15) jumpFalse1 +46         # pc 61
                            ;# (17) startCommand +34 1         # next cmd at pc 51
   load i                   ;# (26) loadScalar1 %v3         # var "i"
   push b                   ;# (28) push1 0         # "b"
   streq                    ;# (30) streq
   jumpFalse L49            ;# (31) jumpFalse1 +18         # pc 49
                            ;# (33) startCommand +14 1         # next cmd at pc 47
   jump L10                 ;# (42) jump4 -32         # pc 10
   jump L51                 ;# (47) jump1 +4         # pc 51
 label L49;
   push {}                  ;# (49) push1 1         # ""
 label L51;
   pop                      ;# (51) pop
   push puts                ;# (52) push1 2         # "puts"
   load i                   ;# (54) loadScalar1 %v3         # var "i"
   invokeStk 2              ;# (56) invokeStk1 2
   pop                      ;# (58) pop
   jump L10                 ;# (59) jump1 -49         # pc 10
 label L61;
   push {}                  ;# (61) push1 1         # ""
                            ;# (63) done
 label Done;
}}
======
Quite a mouthful, from a one-liner [proc]... Notice the lines marked (42) and (47), the latter of which is dead code that cannot be reached. I have added code to handle that situation in the ''optimize'' proc above. Another two bytes saved...

<<categories>>Example