Tcl invoke performance

Summary

See also Tcl IO performance, where we reached the

Interim conclusion The time spent in the original benchmark seems to be decomposable in

8% in startup and loop overhead (0.19/2.5)
20% in command call and setting a variable in the called command (0.50/2.50) Further experimentation with a C-coded command that just returns TCL_OK shows that the cost of setting the variable is almost negligible
72% doing the file access, reading and conversion within gets itself
the cost of setting the variable in the called command seems to be below the measurable threshold for the larger dataset (read1a actually faster than read2 - the difference must be noise)

Full Results, v1

Further experiments

on a core instrumented with code developed together with GPS [L1 ]
running a tiny do-almost-nothing script (see below)
using a command empty defined in C to just return Tcl_Ok;

provide the following timings for the most-exercised opcodes (measured in cpu ticks at 1.6MHz):

8.5a6 time: 1822167696, count: 20004698
----------------------------------------------------
op         %T       avgT         %ops         Nops
6       24.53     446.87         5.00      1000323      INST_INVOKE_STK1
80      22.68     413.32         5.00      1000019      INST_LIST_INDEX
29      10.04      91.47        10.00      2000048      INST_INCR_SCALAR1_IMM
105      9.77      44.52        20.00      4000452      INST_START_CMD
103      7.58     138.16         5.00      1000004      INST_LIST_INDEX_IMM
10       7.20      26.22        25.00      5000750      INST_LOAD_SCALAR1
17       5.78      52.67        10.00      2000236      INST_STORE_SCALAR1
47       5.23      95.32         5.00      1000001      INST_LT
1        4.19      38.11        10.00      2001039      INST_PUSH1
3        2.95      53.83         5.00      1000302      INST_POP

Note: later measurements were done on 5M instead of 1M runs. If you want to compare the total runtime, this first batch should be considered to take 9110838480 clicks (1822167696*5)

Striking observations

(These require an explanation)

Taking the fastest opcode as comparison basis:

INST_LOAD_SCALAR1 is the fastest opcode (faster that INST_PUSH1 and INST_POP!), INST_STORE_SCALAR1 is pretty fast too
lindex is amazingly slow - the non-immediate version is as slow as a command invocation
command invocations are expensive (remark that only empty is invoked in the loop)
comparisons (INST_LT) and basic arithmetic (INST_INCR_SCALAR_IMM1) are amazingly slow when compared to basic variable access
the "pure loss" INST_START_CMD is amazingly slow

(do note that INST_LOAD_SCALAR1 provides a generous upper bound on the cost of opcode dispatch - and the possible savings when improving that part only)

Test Script

The script being run is

lappend auto_path /home/CVS/emptyFunc/
package require empty

exec /usr/bin/taskset -p 0x00000001 [pid]

proc main N {
    set y 0
    set a [list foo boo moo]
    for {set i 0} {$i < $N} {incr i} {
	empty 1
	incr y
	set z [lindex $a 1]
	set z 1
	lindex $a $z
    }
}

if {[llength $argv]} {
    main [lindex $argv 0]
} else {
    main 1000000
}

Full Results, v2

2007-11-09 Committed patch #1829248 [L2 ] that caches some frequently accessed TSD fields in the ekeko. INST_INVOKE_STK1 goes down from 503 ticks to 306 (same measurements as above, did it get slower in the meantime??). New timings:

8.5a6 time: 8187552172, count: 100004702
----------------------------------------------------
op         %T       avgT         %ops         Nops
80      25.07     410.45         5.00      5000019      INST_LIST_INDEX
6       18.63     305.00         5.00      5000323      INST_INVOKE_STK1
29      11.37      93.12        10.00     10000048      INST_INCR_SCALAR1_IMM
105      9.67      39.58        20.00     20000453      INST_START_CMD
10       8.20      26.85        25.00     25000750      INST_LOAD_SCALAR1
47       7.35     120.43         5.00      5000001      INST_LT
17       6.07      49.74        10.00     10000236      INST_STORE_SCALAR1
103      5.26      86.07         5.00      5000005      INST_LIST_INDEX_IMM
1        5.11      41.86        10.00     10001040      INST_PUSH1
3        3.26      53.38         5.00      5000302      INST_POP

Full Results, v3

2007-11-12 A few commits after that: new stack check approach, lindex changes in tebc, improved usage of objType to prevent unneeded C calls, shortcut path for reading ints and indices.

Note: first posting here was better, but a wrong measurement. Sorry.

8.5a6 time: 6083734760, count: 100004702
----------------------------------------------------
op         %T       avgT         %ops         Nops
6       20.32     247.27         5.00      5000321      INST_INVOKE_STK1
29      14.53      88.42        10.00     10000048      INST_INCR_SCALAR1_IMM
105     13.34      40.59        20.00     20000453      INST_START_CMD
10      11.54      28.08        25.00     25000750      INST_LOAD_SCALAR1
17       9.41      57.26        10.00     10000236      INST_STORE_SCALAR1
47       7.75      94.26         5.00      5000001      INST_LT
103      6.41      77.97         5.00      5000005      INST_LIST_INDEX_IMM
80       6.26      76.20         5.00      5000019      INST_LIST_INDEX
1        5.90      35.90        10.00     10001040      INST_PUSH1
3        4.51      54.84         5.00      5000302      INST_POP

By selective removing or duplicating code I estimate that, out of the 250 clicks of INST_INVOKE_STK1 (TEOvI), about 35 each are consumed by Tcl_GetCommandFromObj and Tcl_ResetResult. The trace code (surprisingly?) seems not to make any measurable difference.

Note that commandReform (reducing the cost of Tcl_GetCommandFromObj) has at best a hope of shaving 20 clicks, less than 10% of the total.

Tcl_ResetResult could conceivably be eliminated altogether, but this entails a change in the API: commands are not guaranteed to receive a clean objResult, errorCode, errorInfo or other return options. As they would have to reset the result themselves (when calling Tcl_SetObjResult) , most of the apparent win is not really there: we'd just be saving a Tcl_NewObj/Tcl_IncrRefCount/Tcl_DecrRefCount stanza. This one could conceivably be saved by allowing iPtr->objResultPtr to be NULL (and Tcl_GetObjResult fill it in if necessary), but that will break the deprecated but still used idiom of appending to the result Tcl_Obj. Small win, big pain - a no-go.

Category Internals	Category Performance