**Summary** See also [Tcl IO performance], where we reached the '''Interim conclusion''' The time spent in the original benchmark seems to be decomposable in * 8% in startup and loop overhead (0.19/2.5) * 20% in command call and setting a variable in the called command (0.50/2.50) ''Further experimentation with a C-coded command that just returns TCL_OK shows that the cost of setting the variable is almost negligible'' * 72% doing the file access, reading and conversion within [gets] itself * the cost of setting the variable in the called command seems to be below the measurable threshold for the larger dataset (read1a actually faster than read2 - the difference ''must'' be noise) ---- **Full Results, v1** Further experiments * on a core instrumented with code developed together with [GPS] [https://sourceforge.net/tracker/index.php?func=detail&aid=1828178&group_id=10894&atid=310894] * running a tiny do-almost-nothing script (see below) * using a command '''empty''' defined in C to just ''return Tcl_Ok;'' provide the following timings for the most-exercised opcodes (measured in cpu ticks at 1.6MHz): ====== 8.5a6 time: 1822167696, count: 20004698 ---------------------------------------------------- op %T avgT %ops Nops 6 24.53 446.87 5.00 1000323 INST_INVOKE_STK1 80 22.68 413.32 5.00 1000019 INST_LIST_INDEX 29 10.04 91.47 10.00 2000048 INST_INCR_SCALAR1_IMM 105 9.77 44.52 20.00 4000452 INST_START_CMD 103 7.58 138.16 5.00 1000004 INST_LIST_INDEX_IMM 10 7.20 26.22 25.00 5000750 INST_LOAD_SCALAR1 17 5.78 52.67 10.00 2000236 INST_STORE_SCALAR1 47 5.23 95.32 5.00 1000001 INST_LT 1 4.19 38.11 10.00 2001039 INST_PUSH1 3 2.95 53.83 5.00 1000302 INST_POP ====== ''Note'': later measurements were done on 5M instead of 1M runs. If you want to compare the total runtime, this first batch should be considered to take 9110838480 clicks (1822167696*5) ***Striking observations*** (These require an explanation) Taking the fastest opcode as comparison basis: 1. INST_LOAD_SCALAR1 is the fastest opcode (faster that INST_PUSH1 and INST_POP!), INST_STORE_SCALAR1 is pretty fast too 1. [lindex] is amazingly slow - the non-immediate version is as slow as a command invocation 1. command invocations are expensive (remark that only '''empty''' is invoked in the loop) 1. comparisons (INST_LT) and basic arithmetic (INST_INCR_SCALAR_IMM1) are amazingly slow when compared to basic variable access 1. the "pure loss" INST_START_CMD is amazingly slow (do note that INST_LOAD_SCALAR1 provides a ''generous'' upper bound on the cost of opcode dispatch - and the possible savings when improving that part only) ***Test Script*** The script being run is ====== lappend auto_path /home/CVS/emptyFunc/ package require empty exec /usr/bin/taskset -p 0x00000001 [pid] proc main N { set y 0 set a [list foo boo moo] for {set i 0} {$i < $N} {incr i} { empty 1 incr y set z [lindex $a 1] set z 1 lindex $a $z } } if {[llength $argv]} { main [lindex $argv 0] } else { main 1000000 } ====== ---- **Full Results, v2** 2007-11-09 Committed patch #1829248 [https://sourceforge.net/tracker/index.php?func=detail&aid=1829248&group_id=10894&atid=310894] that caches some frequently accessed TSD fields in the [ekeko]. '''INST_INVOKE_STK1 goes down from 503 ticks to 306''' (same measurements as above, did it get slower in the meantime??). New timings: ====== 8.5a6 time: 8187552172, count: 100004702 ---------------------------------------------------- op %T avgT %ops Nops 80 25.07 410.45 5.00 5000019 INST_LIST_INDEX 6 18.63 305.00 5.00 5000323 INST_INVOKE_STK1 29 11.37 93.12 10.00 10000048 INST_INCR_SCALAR1_IMM 105 9.67 39.58 20.00 20000453 INST_START_CMD 10 8.20 26.85 25.00 25000750 INST_LOAD_SCALAR1 47 7.35 120.43 5.00 5000001 INST_LT 17 6.07 49.74 10.00 10000236 INST_STORE_SCALAR1 103 5.26 86.07 5.00 5000005 INST_LIST_INDEX_IMM 1 5.11 41.86 10.00 10001040 INST_PUSH1 3 3.26 53.38 5.00 5000302 INST_POP ====== ---- **Full Results, v3** 2007-11-12 A few commits after that: new stack check approach, [lindex] changes in tebc, improved usage of ''objType'' to prevent unneeded C calls, shortcut path for reading ints and indices. ''Note: first posting here was better, but a wrong measurement. Sorry.'' ====== 8.5a6 time: 6083734760, count: 100004702 ---------------------------------------------------- op %T avgT %ops Nops 6 20.32 247.27 5.00 5000321 INST_INVOKE_STK1 29 14.53 88.42 10.00 10000048 INST_INCR_SCALAR1_IMM 105 13.34 40.59 20.00 20000453 INST_START_CMD 10 11.54 28.08 25.00 25000750 INST_LOAD_SCALAR1 17 9.41 57.26 10.00 10000236 INST_STORE_SCALAR1 47 7.75 94.26 5.00 5000001 INST_LT 103 6.41 77.97 5.00 5000005 INST_LIST_INDEX_IMM 80 6.26 76.20 5.00 5000019 INST_LIST_INDEX 1 5.90 35.90 10.00 10001040 INST_PUSH1 3 4.51 54.84 5.00 5000302 INST_POP ====== By selective removing or duplicating code I estimate that, out of the 250 clicks of INST_INVOKE_STK1 (TEOvI), about 35 each are consumed by Tcl_GetCommandFromObj and Tcl_ResetResult. The trace code (surprisingly?) seems not to make any measurable difference. Note that commandReform (reducing the cost of Tcl_GetCommandFromObj) has at best a hope of shaving 20 clicks, less than 10% of the total. Tcl_ResetResult could conceivably be eliminated altogether, but this entails a change in the API: commands are not guaranteed to receive a clean objResult, errorCode, errorInfo or other return options. As they would have to reset the result themselves (when calling Tcl_SetObjResult) , most of the apparent win is not really there: we'd just be saving a Tcl_NewObj/Tcl_IncrRefCount/Tcl_DecrRefCount stanza. This one could conceivably be saved by allowing iPtr->objResultPtr to be NULL (and Tcl_GetObjResult fill it in if necessary), but that will break the deprecated but still used idiom of appending to the result Tcl_Obj. Small win, big pain - a no-go. ---- !!!!!! %| [Category Internals] | [Category Performance] |% !!!!!!