** Loop behaviour **

Playing with the bytecode stepper in [Visual TAL], I was a bit surprised to see the bytecode for [foreach] invoking [llength] at every iteration.  This made me wonder about the performance of common looping constructs, and somewhat surprising results.

As with any microbenchmark, this is probably measuring the wrong thing.

*** Specification ***

The obvious question is, "how does `foreach` behave if you modify the list under it?".  The answer is conveniently omitted from the `foreach` manual, but provided clearly enough by the dodekalogue:  `foreach` has its list arguments substituted before it is called (in other words: passed by value) so it cannot be affected by mutations.


*** Setup ***

Making big lists is easy with `split [string repeat ...`, but I have this in my prelude to start with:

======
proc range {from to} {
    if {$to<$from} {
        foreach from to $to $from
    }
    set res {}
    while {$from<=$to} {lappend res $from; incr from}
    set res
}
======

And our test var.  I initialise it this way because if pasting in the shell, you don't really want to see the value of `x` echoed back at you.
======
set x [range 0 1e6]; llength x
======
(side note:  will such a list always have an odd number of elements?  I don't know.)

Here are the procs we want to test:
======
proc f {x} {
    foreach i $x {
        incr j
    }
    set j
}
proc g {x} {
    for {set a 0} {$a<[llength $x]} {incr a} {
        incr j
    }
    set j
}
proc h {x} {
    set l [llength $x]
    for {set a 0} {$a<$l} {incr a} {
        incr j
    }
    set j
}
======


*** (micro)Benchmarking ***

It turns out that iterating over a million-and-one element list is a bit slow on my netbook, so initial timing results are of a small sample:

  tclsh-8.6.0 % time {f $x} 10
  805257.6 microseconds per iteration
  tclsh-8.6.0 % time {g $x} 10
  502229.6 microseconds per iteration
  tclsh-8.6.0 % time {h $x} 10
  285997.6 microseconds per iteration

Striking.  Of course this benchmark isn't exactly fair, since [foreach] is accessing the contents of the list as well as iterating over it.  Suitably handicapping `g` and `h` is easy:

======
rename g g_
proc g {x} {
    for {set a 0} {$a<[llength $x]} {incr a} {
        set i [lindex $x $a]
        incr j
    }
    set j
}
rename h h_
proc h {x} {
    set l [llength $x]
    for {set a 0} {$a<$l} {incr a} {
        set i [lindex $x $a]
        incr j
    }
    set j
}
======

This brings the results closer to [foreach], but the same performance degradation is apparent:

  tclsh-8.6.0 % time {g $x} 10
  688677.9 microseconds per iteration
  tclsh-8.6.0 % time {h $x} 10
  460473.2 microseconds per iteration

Curiously, `foreach` is still the slowest version.  I hope this is an artifact of the system I am testing on.

*** Disassembly ***

Using [disassemble] on `g`, `h` and their predecessors reveals nothing unexpected.  Looking at `f` though:

  Command 1: "foreach i $x {incr j}"
    (0) loadScalar1 %v0         # var "x"
    (2) storeScalar1 %v1        # temp var 1
    (4) pop 
    (5) foreach_start4 0 
                [data=[%v1], loop=%v2
                 it%v1  [%v3]]
    (10) foreach_step4 0 
                [data=[%v1], loop=%v2
                 it%v1  [%v3]]
    (15) jumpFalse1 +17         # pc 32
  Command 2: "incr j"
    (17) startCommand +12 1     # next cmd at pc 29
    (26) incrScalar1Imm %v4 +1  # var "j"
    (29) pop 
    (30) jump1 -20      # pc 10
    (32) push1 0        # ""
    (34) pop 
  Command 3: "set j"
    (35) startCommand +11 1     # next cmd at pc 46
    (44) loadScalar1 %v4        # var "j"
    (46) done 

we see bytecode entries for `foreach_start4` and `foreach_step4`.  It turns out my initial data was wrong:  [dis2asm] unrolls these instructions into similar code that we see in `g`'s loop.  Indeed these are bytecodes:  in `tclExecute.c` you will find about 100 lines of C at jump label `INST_FOREACH_STEP4`, confirming that it is taking the length of every list at every iteration.


*** Ruminations ***

I don't really know what to draw from this, but it raises some questions.  Is this measurement meaningful?  If so, why is `foreach` measurably slower than a loop which apparently does the same thing?  (I am reminded of the 386's `loop` opcode, which was significantly slower than the unrolled combination `dec cx; jnz` leading to its disuse outside of 4k demos).  Is this an indication that rather than more bytecode instructions, what is really needed to boost Tcl performance is commands to support script-level interaction with the compiler? ... I'm dreaming about lisp macros and stuff like [Sugar]).

----
<<discussion>>
[RS] 2013-12-04 Thanks for trying [Visual TAL]! As I wrote in [dis2asm does macros], I could not find [TAL] instructions corresponding to foreach_start and foreach_step in tcl::unsupported::[assemble], so the "macro expansion" I concocted was a workaround.

In Tcl [list] objects, the length is a data field (just as strings know their [string length]), so it does not cost much time to retrieve that. My first attempt had a separate variable for the length, but I was then sure it is not needed.

[DKF]: Adding the `foreach*` instructions is one of the TODOs, but they've got some weird side-conditions on them that would need modelling. (Also, TAL can't currently construct auxiliary data blobs, which the foreach instructions use.) The real complexity of the foreach stuff is there to handle the multi-variable multi-list cases.

[RS] True... [dis2asm] already fails horribly on the simple variation
 proc f x {set res {};foreach {i j} $x {if {$i>0} {lappend res $i}};set res}

[DKF]: Of course, this also means that it might be worthwhile for [foreach] to be optimised in the one-var-one-list case by not using the `foreach*` instructions. I shall need to investigate this further.

[RS] See at end of [dis2asm does macros] how I manually expanded the ''foreach'' macros for multi-variable and multi-list cases - in hindsight, the code to be generated is pretty regular.

[DKF]: I think I'd definitely focus on the one/one case; it's massively more common than the others, so it's worth putting special effort into.

[RS]: I agree that the one/one case is the most frequent; but the complexity of the code to generate grows almost linearly with the number of iterators:
   * foreach_start must expand into 3 instructions per iterator
   * foreach_step must expand into 9 instructions per iterator...
   * ... and finally tack on one ''lor'' instruction for each iterator except the first
With a cleverer macro expander (which parses the extra argument lines directly before generating the code), this will be an easy job - I will do it later today :^D

[MS] After reading this I finally decide to revisit the foreach bytecodes - see http://core.tcl.tk/tcl/info/3782af5f5a. The proposed bytecodes for f now look like:
===

% tcl::unsupported::disassemble proc f
ByteCode 0x0x14416f0, refCt 1, epoch 15, interp 0x0x13f35d0 (epoch 15)
  Source "\n    foreach i $x {\n        incr j\n    }\n    set j\n"
  Cmds 3, src 51, inst 19, litObjs 1, aux 1, stkDepth 4, code/src 0.00
  Proc 0x0x143c7c0, refCt 1, args 1, compiled locals 3
      slot 0, scalar, arg, "x"
      slot 1, scalar, "i"
      slot 2, scalar, "j"
  Exception ranges 1, depth 1:
      0: level 0, loop, pc 7-9, continue 11, break 12
  Commands 3:
      1: pc 0-15, src 5-39        2: pc 7-9, src 28-33
      3: pc 16-17, src 45-49
  Command 1: "foreach i $x {\n        incr j\n    }"
    (0) loadScalar1 %v0         # var "x"
    (2) foreach_start 0 
                [data=[%v0], loop=%v18446744073709551612
                 it%v0        [%v1]]
  Command 2: "incr j"
    (7) incrScalar1Imm %v2 +1         # var "j"
    (10) pop 
    (11) foreach_step 
    (12) foreach_end 
    (13) nop 
    (14) nop 
    (15) nop 
  Command 3: "set j"
    (16) loadScalar1 %v2         # var "j"
    (18) done 
===

<<categories>> Performance