Version 3 of Tcl invoke performance

See also Tcl IO performance, where we reached the

Interim conclusion The time spent in the original benchmark seems to be decomposable in

8% in startup and loop overhead (0.19/2.5)
20% in command call and setting a variable in the called command (0.50/2.50) Further experimentation with a C-coded command that just returns TCL_OK shows that the cost of setting the variable is almost negligible
72% doing the file access, reading and conversion within gets itself
the cost of setting the variable in the called command seems to be below the measurable threshold for the larger dataset (read1a actually faster than read2 - the difference must be noise)

Further experiments

on a core instrumented with code developed together with GPS
running a tiny do-almost-nothing script (see below)
using a command empty defined in C to just return Tcl_Ok;

provide the following timings for the most-exercised opcodes (measured in cpu ticks at 1.6MHz):

8.5a6 time: 1822167696, count: 20004698
----------------------------------------------------
op         %T       avgT         %ops         Nops
6       24.53     446.87         5.00      1000323      INST_INVOKE_STK1
80      22.68     413.32         5.00      1000019      INST_LIST_INDEX
29      10.04      91.47        10.00      2000048      INST_INCR_SCALAR1_IMM
105      9.77      44.52        20.00      4000452      INST_START_CMD
103      7.58     138.16         5.00      1000004      INST_LIST_INDEX_IMM
10       7.20      26.22        25.00      5000750      INST_LOAD_SCALAR1
17       5.78      52.67        10.00      2000236      INST_STORE_SCALAR1
47       5.23      95.32         5.00      1000001      INST_LT
1        4.19      38.11        10.00      2001039      INST_PUSH1
3        2.95      53.83         5.00      1000302      INST_POP

Striking observations (which require an explanation)

Taking the fastest opcode as comparison basis:

INST_LOAD_SCALAR1 is the fastest opcode (faster that INST_PUSH1 and INST_POP!), INST_STORE_STACK1 is pretty fast too
lindex is amazingly slow - the non-immediate version is as slow as a command invocation
command invocations are expensive (remark that only empty is invoked in the loop)
comparisons (INST_LT) and basic arithmetic (INST_INCR_SCALAR_IMM1) are amazingly slow when compared to basic variable access
the "pure loss" INST_START_CMD is amazingly slow

(do note that INST_LOAD_SCALAR1 provides a generous upper bound on the cost of opcode dispatch - and the possible savings when improving that part only)

The script being run is

lappend auto_path /home/CVS/emptyFunc/
package require empty

exec /usr/bin/taskset -p 0x00000001 [pid]

proc main N {
    set y 0
    set a [list foo boo moo]
    for {set i 0} {$i < $N} {incr i} {
	empty 1
	incr y
	set z [lindex $a 1]
	set z 1
	lindex $a $z
    }
}

if {[llength $argv]} {
    main [lindex $argv 0]
} else {
    main 1000000
}