Version 5 of loop microbenchmark

Updated 2013-12-04 22:53:16 by suchenwi

Loop behaviour

Playing with the bytecode stepper in Visual TAL, I was a bit surprised to see the bytecode for foreach invoking llength at every iteration. This made me wonder about the performance of common looping constructs, and somewhat surprising results.

As with any microbenchmark, this is probably measuring the wrong thing.

Specification

The obvious question is, "how does foreach behave if you modify the list under it?". The answer is conveniently omitted from the foreach manual, but provided clearly enough by the dodekalogue: foreach has its list arguments substituted before it is called (in other words: passed by value) so it cannot be affected by mutations.

Setup

Making big lists is easy with split [string repeat ..., but I have this in my prelude to start with:

proc range {from to} {
    if {$to<$from} {
        foreach from to $to $from
    }
    set res {}
    while {$from<=$to} {lappend res $from; incr from}
    set res
}

And our test var. I initialise it this way because if pasting in the shell, you don't really want to see the value of x echoed back at you.

set x [range 0 1e6]; llength x

(side note: will such a list always have an odd number of elements? I don't know.)

Here are the procs we want to test:

proc f {x} {
    foreach i $x {
        incr j
    }
    set j
}
proc g {x} {
    for {set a 0} {$a<[llength $x]} {incr a} {
        incr j
    }
    set j
}
proc h {x} {
    set l [llength $x]
    for {set a 0} {$a<$l} {incr a} {
        incr j
    }
    set j
}

(micro)Benchmarking

It turns out that iterating over a million-and-one element list is a bit slow on my netbook, so initial timing results are of a small sample:

  tclsh-8.6.0 % time {f $x} 10
  805257.6 microseconds per iteration
  tclsh-8.6.0 % time {g $x} 10
  502229.6 microseconds per iteration
  tclsh-8.6.0 % time {h $x} 10
  285997.6 microseconds per iteration

Striking. Of course this benchmark isn't exactly fair, since foreach is accessing the contents of the list as well as iterating over it. Suitably handicapping g and h is easy:

rename g g_
proc g {x} {
    for {set a 0} {$a<[llength $x]} {incr a} {
        set i [lindex $x $a]
        incr j
    }
    set j
}
rename h h_
proc h {x} {
    set l [llength $x]
    for {set a 0} {$a<$l} {incr a} {
        set i [lindex $x $a]
        incr j
    }
    set j
}

This brings the results closer to foreach, but the same performance degradation is apparent:

  tclsh-8.6.0 % time {g $x} 10
  688677.9 microseconds per iteration
  tclsh-8.6.0 % time {h $x} 10
  460473.2 microseconds per iteration

Curiously, foreach is still the slowest version. I hope this is an artifact of the system I am testing on.

Disassembly

Using disassemble on g, h and their predecessors reveals nothing unexpected. Looking at f though:

  Command 1: "foreach i $x {incr j}"
    (0) loadScalar1 %v0         # var "x"
    (2) storeScalar1 %v1        # temp var 1
    (4) pop 
    (5) foreach_start4 0 
                [data=[%v1], loop=%v2
                 it%v1  [%v3]]
    (10) foreach_step4 0 
                [data=[%v1], loop=%v2
                 it%v1  [%v3]]
    (15) jumpFalse1 +17         # pc 32
  Command 2: "incr j"
    (17) startCommand +12 1     # next cmd at pc 29
    (26) incrScalar1Imm %v4 +1  # var "j"
    (29) pop 
    (30) jump1 -20      # pc 10
    (32) push1 0        # ""
    (34) pop 
  Command 3: "set j"
    (35) startCommand +11 1     # next cmd at pc 46
    (44) loadScalar1 %v4        # var "j"
    (46) done 

we see bytecode entries for foreach_start4 and foreach_step4. It turns out my initial data was wrong: dis2asm unrolls these instructions into similar code that we see in g's loop. Indeed these are bytecodes: in tclExecute.c you will find about 100 lines of C at jump label INST_FOREACH_STEP4, confirming that it is taking the length of every list at every iteration.

Ruminations

I don't really know what to draw from this, but it raises some questions. Is this measurement meaningful? If so, why is foreach measurably slower than a loop which apparently does the same thing? (I am reminded of the 386's loop opcode, which was significantly slower than the unrolled combination dec cx; jnz leading to its disuse outside of 4k demos). Is this an indication that rather than more bytecode instructions, what is really needed to boost Tcl performance is commands to support script-level interaction with the compiler? ... I'm dreaming about lisp macros and stuff like Sugar).


RS 2013-12-04 Thanks for trying Visual TAL! As I wrote in dis2asm does macros, I could not find TAL instructions corresponding to foreach_start and foreach_step in tcl::unsupported::assemble, so the "macro expansion" I concocted was a workaround.

In Tcl list objects, the length is a data field (just as strings know their string length), so it does not cost much time to retrieve that. My first attempt had a separate variable for the length, but I was then sure it is not needed.

DKF: Adding the foreach* instructions is one of the TODOs, but they've got some weird side-conditions on them that would need modelling. (Also, TAL can't currently construct auxiliary data blobs, which the foreach instructions use.) The real complexity of the foreach stuff is there to handle the multi-variable multi-list cases.

RS True... dis2asm already fails horribly on the simple variation

 proc f x {set res {};foreach {i j} $x {if {$i>0} {lappend res $i}};set res}