Tcl invoke performance

Summary

See also Tcl IO performance, where we reached the

Interim conclusion The time spent in the original benchmark seems to be decomposable in

  • 8% in startup and loop overhead (0.19/2.5)
  • 20% in command call and setting a variable in the called command (0.50/2.50) Further experimentation with a C-coded command that just returns TCL_OK shows that the cost of setting the variable is almost negligible
  • 72% doing the file access, reading and conversion within gets itself
  • the cost of setting the variable in the called command seems to be below the measurable threshold for the larger dataset (read1a actually faster than read2 - the difference must be noise)

Full Results, v1

Further experiments

  • on a core instrumented with code developed together with GPS [L1 ]
  • running a tiny do-almost-nothing script (see below)
  • using a command empty defined in C to just return Tcl_Ok;

provide the following timings for the most-exercised opcodes (measured in cpu ticks at 1.6MHz):

8.5a6 time: 1822167696, count: 20004698
----------------------------------------------------
op         %T       avgT         %ops         Nops
6       24.53     446.87         5.00      1000323      INST_INVOKE_STK1
80      22.68     413.32         5.00      1000019      INST_LIST_INDEX
29      10.04      91.47        10.00      2000048      INST_INCR_SCALAR1_IMM
105      9.77      44.52        20.00      4000452      INST_START_CMD
103      7.58     138.16         5.00      1000004      INST_LIST_INDEX_IMM
10       7.20      26.22        25.00      5000750      INST_LOAD_SCALAR1
17       5.78      52.67        10.00      2000236      INST_STORE_SCALAR1
47       5.23      95.32         5.00      1000001      INST_LT
1        4.19      38.11        10.00      2001039      INST_PUSH1
3        2.95      53.83         5.00      1000302      INST_POP

Note: later measurements were done on 5M instead of 1M runs. If you want to compare the total runtime, this first batch should be considered to take 9110838480 clicks (1822167696*5)

Striking observations

(These require an explanation)

Taking the fastest opcode as comparison basis:

  1. INST_LOAD_SCALAR1 is the fastest opcode (faster that INST_PUSH1 and INST_POP!), INST_STORE_SCALAR1 is pretty fast too
  2. lindex is amazingly slow - the non-immediate version is as slow as a command invocation
  3. command invocations are expensive (remark that only empty is invoked in the loop)
  4. comparisons (INST_LT) and basic arithmetic (INST_INCR_SCALAR_IMM1) are amazingly slow when compared to basic variable access
  5. the "pure loss" INST_START_CMD is amazingly slow

(do note that INST_LOAD_SCALAR1 provides a generous upper bound on the cost of opcode dispatch - and the possible savings when improving that part only)

Test Script

The script being run is

lappend auto_path /home/CVS/emptyFunc/
package require empty

exec /usr/bin/taskset -p 0x00000001 [pid]

proc main N {
    set y 0
    set a [list foo boo moo]
    for {set i 0} {$i < $N} {incr i} {
	empty 1
	incr y
	set z [lindex $a 1]
	set z 1
	lindex $a $z
    }
}

if {[llength $argv]} {
    main [lindex $argv 0]
} else {
    main 1000000
}

Full Results, v2

2007-11-09 Committed patch #1829248 [L2 ] that caches some frequently accessed TSD fields in the ekeko. INST_INVOKE_STK1 goes down from 503 ticks to 306 (same measurements as above, did it get slower in the meantime??). New timings:

8.5a6 time: 8187552172, count: 100004702
----------------------------------------------------
op         %T       avgT         %ops         Nops
80      25.07     410.45         5.00      5000019      INST_LIST_INDEX
6       18.63     305.00         5.00      5000323      INST_INVOKE_STK1
29      11.37      93.12        10.00     10000048      INST_INCR_SCALAR1_IMM
105      9.67      39.58        20.00     20000453      INST_START_CMD
10       8.20      26.85        25.00     25000750      INST_LOAD_SCALAR1
47       7.35     120.43         5.00      5000001      INST_LT
17       6.07      49.74        10.00     10000236      INST_STORE_SCALAR1
103      5.26      86.07         5.00      5000005      INST_LIST_INDEX_IMM
1        5.11      41.86        10.00     10001040      INST_PUSH1
3        3.26      53.38         5.00      5000302      INST_POP

Full Results, v3

2007-11-12 A few commits after that: new stack check approach, lindex changes in tebc, improved usage of objType to prevent unneeded C calls, shortcut path for reading ints and indices.

Note: first posting here was better, but a wrong measurement. Sorry.

8.5a6 time: 6083734760, count: 100004702
----------------------------------------------------
op         %T       avgT         %ops         Nops
6       20.32     247.27         5.00      5000321      INST_INVOKE_STK1
29      14.53      88.42        10.00     10000048      INST_INCR_SCALAR1_IMM
105     13.34      40.59        20.00     20000453      INST_START_CMD
10      11.54      28.08        25.00     25000750      INST_LOAD_SCALAR1
17       9.41      57.26        10.00     10000236      INST_STORE_SCALAR1
47       7.75      94.26         5.00      5000001      INST_LT
103      6.41      77.97         5.00      5000005      INST_LIST_INDEX_IMM
80       6.26      76.20         5.00      5000019      INST_LIST_INDEX
1        5.90      35.90        10.00     10001040      INST_PUSH1
3        4.51      54.84         5.00      5000302      INST_POP

By selective removing or duplicating code I estimate that, out of the 250 clicks of INST_INVOKE_STK1 (TEOvI), about 35 each are consumed by Tcl_GetCommandFromObj and Tcl_ResetResult. The trace code (surprisingly?) seems not to make any measurable difference.

Note that commandReform (reducing the cost of Tcl_GetCommandFromObj) has at best a hope of shaving 20 clicks, less than 10% of the total.

Tcl_ResetResult could conceivably be eliminated altogether, but this entails a change in the API: commands are not guaranteed to receive a clean objResult, errorCode, errorInfo or other return options. As they would have to reset the result themselves (when calling Tcl_SetObjResult) , most of the apparent win is not really there: we'd just be saving a Tcl_NewObj/Tcl_IncrRefCount/Tcl_DecrRefCount stanza. This one could conceivably be saved by allowing iPtr->objResultPtr to be NULL (and Tcl_GetObjResult fill it in if necessary), but that will break the deprecated but still used idiom of appending to the result Tcl_Obj. Small win, big pain - a no-go.