Version 13 of Radical reform of the execution engine

Tcl Performance - taking it to the next level

The purpose of this page is to provide an anchor point to discuss some half-formed (half-baked?) ideas that various Tcl'ers (including Miguel Sofer, Donal Fellows and Kevin Kenny) discussed, largely in the hallway and the bar, at the 2007 Tcl conference in New Orleans.

The basic idea is that Miguel's work on bytecode engine performance, variable reform, and so on, can only go so far. Eventually some more radical reforms will be needed. These reforms can't be undertaken lightly. For one thing, they will break compatibility with tbcload when they are introduced (although it is possible that a code translator can restore compatibility). Nevertheless, if Tcl is to go much faster, bytecode reform looks like nearly the only path forward. We have nearly exhausted the improvements gained by bytecoding individual commands; indeed, a useful subset of Tcl procedures now converts to 100% bytecode, with no callouts to command procedures.

Some of the possible improvements are fairly easy and benign:

Replace the stack-oriented bytecodes with a register machine. Others report that register-oriented bytecode engines are significantly faster than stack-oriented ones.

Larry Smith I'd like to see that cite. Since the entire virtual machine must operate from memory, I don't see how structuring a var as a stack offset is different from having it an int or long somewhere.

MJ - A lot of discussion and benchmarking went into this when the Lua guys switched to a register based in 5.0 see for instance [L1 ]. A google search of "register vs stack based vm" turns up some more interesting links.

TP One register-based VM, parrot is already being used by the Tcl implementation ParTcl. Perhaps the authors of ParTcl can chime in with recent developments or benchmarks.

Replace bytecodes with wordcodes. Bytecode interpretation involves a fair amount of misaligned memory access; being able to store, say, 32-bit integers on a 32-byte boundary eliminates a fair amount of loading and shifting.

Larry Smith I like bytecodes, they are compact. To deal with in-line data, like pointers or integers, use a prefix or suffix data table. The byte code need not specify a data item in line, it need only use the next byte to indicate an offset into the data table.

Replace the "big switch statement" with threaded[L2 ] code. This part needs to be undertaken fairly carefully; it's not clear that threading yields better performance in all cases.

Larry SmithI believe most C compilers, including gcc, compile switch statements as a series of if/else tests, then try to tidy them up with an optimization pass later in code gen. C has no nice, clean case statement like Pascal, where the statement can be coded as a jump table. However, you can code a jump table using a constant array of pointers to the relevant functions. The inner loop of the interpreter then becomes "for(;;) { pc = jumptable(pc) }" where "jumptable" is an array of functions implementing each bytecode which return the address of the next instruction to be executed. If you do a setjump and have the exit opcode call longjmp, there are no conditions that the inner loop need detect, it makes for a very fast code. Doing this also means that the jumptable and inner loop code all fit into cache memory, meaning the only cache misses come from data. If your code is compact enough to fit all into cache as well, you will have all the software machinery of the interpreted program running from cache without misses. This could turn out to be faster than native code that can often miss the cache and which will also require the vm system to page code and and out of memory.

There are several other reforms that can be tried in this general area. Miguel has some ideas for "command reform" similar to the recently-committed "variable reform", for instance.

Beyond that, we get into a realm where we need to begin to consider the safety of optimisations. The next big win looks to appear from the unboxing of local variables; essentially, in the (overwhelmingly common) case where a variable local to a procedure has no aliases and is not traced, we can eliminate the variable object altogether. Together with that, the overhead of checking for traces goes away, as does the overhead of repeated resolution of the variable name (the caching of the mapping from variable name to variable object is less effective than it might be).

For a concrete example, let's examine a procedure for computing the inner product of two vectors:

    proc dot {a b} {
        set prod 0.0
        foreach x $a y $b {
            set prod [expr {$prod + $x*$y}]
        }
        return $prod
    }

What we need to do, to promote local variables to registers, is to ascertain that

Particular blocks of code fire no execution or variable traces. At the very least, traces that do fire must not violate any of these rules with respect to the variables in question.
Particular blocks of code establish no new execution or variable traces.
Particular blocks of code do not change the semantics of Core commands within the blocks (here, set, expr, foreach and return), either by redefinition, overloading in a namespace, or changing command resolution rules.
Particular blocks of code do not change the variables to which names resolve: either by aliasing with upvar, variable or global, or by changing variable name resolvers.
There is no opportunity to change a variable in an uncontrolled fashion, for example by upvar or uplevel from a called procedure or eval of unknown code in the current procedure.

There are no doubt other rules. It is best to think in terms of proving that code is safe, rather than detecting that it is unsafe. It may in some cases be possible to defer safety checks to interpretation time, and fall back on deoptimized code if the checks fail.

Once variable are lifted to registers, it becomes possible to think about proving useful assertions about the variable contents. For instance, in the procedure above, it is fairly easy to prove that the variable of prod can only be a 'double' - and hence it should be possible to substitute specialized operations that work only on 'double' values, effectively unboxing the value twice (once from the variable and once from the Tcl_Obj).

Finally, if we can infer types on a significant fraction of the operations, it would become profitable to think of generating direct machine code. If this can be done well, it would be quite a tour de force; compiling very high level languages without resorting to extensive programmer-supplied assertions is seldom feasible. It does look to be (just barely) feasible for Tcl, thanks to Tcl's considerable regularities.

All of this, of course, emerges from idle speculation at a conference. (Conferences are a good time to dream big dreams, and a less good time to commit to execution.) But perhaps some of the participants can comment further in this space.

MS has been thinking about the register machine modification; it seems that it might be able to provide a relatively inexpensive way to provide the first-level unboxing for compiled local variables that requires no further analysis. The idea/trick is essentially (recorded here so that it does not go away, and to ellicit commentary):

have a new Var flag bit VAR_INDIRECT, indicating that the value field has a (Tcl_Obj **) instead of a (Tcl_Obj *). Note that this requires an extension of the value union
teach all the var-related code how to deal with VAR_INDIRECT
assign one register slot to each local variable; store its value there, and let the Var be VAR_INDIRECT with a pointer to the register slot in its value

In this manner access to scalar local variables reduces to R/W access to the corresponding register. The question is what happens in the presence of local arrays, upvars and traces. The idea is:

a second variable upvars to this one: no change, we still R/W the register
local variable is an upvar to an untraced foreign scalar variable: the local Var is written as a link to the foreign Var, the foreign Var is made VAR_INDIRECT and its value is stored in the register. Things have to be restored properly on proc return! Note that this also works (details?) if the foreign Var is an array element. We still R/W the register.
local arrays, traces, upvars to arrays or traced variables: the register slot stores 0x1 (or some other impossible value, tagged pointers), the Var is not VAR_INDIRECT. The bytecode falls back to regular local Var handling (as today)

Remark that on normal operation the bytecode engine just accesses the register. If it contains a tagged pointer it reverts to the current standard behaviour, otherwise it operates on the register's Tcl_Obj* directly.

Note that when upvaring to a foreign Var, if the foreign Var already was VAR_INDIRECT, the pre-existing indirection has to be removed and replaced with the new one.

I know this is vague ...

DKF: As an example of the sort of thing we were talking about, when a piece of code goes to pure bytecodes (i.e., no calls of non-bytecoded commands, especially including trace!) and only manipulates local variables (including no manipulation of linked variables created by the likes of global, upvar and variable), then it could be further compiled down to something much more efficient. Probably all the way to machine code. This can then be trivially extended further to the case where the code calls other commands that have had this treatment, and since we can do this determination all at runtime (we have the data, just not coded as such) we can extend this during runtime to cover much of the program. In the other cases, the current bytecode techniques could still be used; this is just an optimization strategy, so we don't need to solve every case.

George Peter Staplin: I've said this for years, and I'll say it again -- I think the VM should be generated by a Tcl script. The Tcl script should be capable of embedding prologue and epilogue code for debugging and performance monitoring, as well as the ability to generate several different kinds of VM. This way we can experiment and we don't have to spend hours only to realize that by choosing a register machine we have a negative performance impact with some machines.

DKF: At the experimental stage? I agree. If we can manage to work out a good set of safety and correctness constraints? We can get more creative. (Remember, we don't need to handle everything. Falling back to bytecode is permitted. This means we can be conservative and punt when things get hard.)

Category Internals