Version 4 of Radical reform of the execution engine

Tcl Performance - taking it to the next level

The purpose of this page is to provide an anchor point to discuss some half-formed (half-baked?) ideas that various Tcl'ers (including Miguel Sofer, Donal Fellows and Kevin Kenny) discussed, largely in the hallway and the bar, at the 2007 Tcl conference in New Orleans.

The basic idea is that Miguel's work on bytecode engine performance, variable reform, and so on, can only go so far. Eventually some more radical reforms will be needed. These reforms can't be undertaken lightly. For one thing, they will break compatibility with tbcload when they are introduced (although it is possible that a code translator can restore compatibility). Nevertheless, if Tcl is to go much faster, bytecode reform looks like nearly the only path forward. We have nearly exhausted the improvements gained by bytecoding individual commands; indeed, a useful subset of Tcl procedures now converts to 100% bytecode, with no callouts to command procedures.

Some of the possible improvements are fairly easy and benign:

Replace the stack-oriented bytecodes with a register machine. Others report that register-oriented bytecode engines are significantly faster than stack-oriented ones.
Replace bytecodes with wordcodes. Bytecode interpretation involves a fair amount of misaligned memory access; being able to store, say, 32-bit integers on a 32-byte boundary eliminates a fair amount of loading and shifting.
Replace the "big switch statement" with threaded code. This part needs to be undertaken fairly carefully; it's not clear that threading yields better performance in all cases.
There are several other reforms that can be tried in this general area. Miguel has some ideas for "command reform" similar to the recently-committed "variable reform", for instance.

Beyond that, we get into a realm where we need to begin to consider the safety of optimisations. The next big win looks to appear from the unboxing of local variables; essentially, in the (overwhelmingly common) case where a variable local to a procedure has no aliases and is not traced, we can eliminate the variable object altogether. Together with that, the overhead of checking for traces goes away, as does the overhead of repeated resolution of the variable name (the caching of the mapping from variable name to variable object is less effective than it might be).

For a concrete example, let's examine a procedure for computing the inner product of two vectors:

    proc dot {a b} {
        set prod 0.0
        foreach x $a y $b {
            set prod [expr {$prod + $x*$y}]
        }
        return $prod
    }

What we need to do, to promote local variables to registers, is to ascertain that

Particular blocks of code fire no execution or variable traces. At the very least, traces that do fire must not violate any of these rules with respect to the variables in question.
Particular blocks of code establish no new execution or variable traces.
Particular blocks of code do not change the semantics of Core commands within the blocks (here, [set], [expr], [foreach] and [return]), either by redefinition, overloading in a namespace, or changing command resolution rules.
Particular blocks of code do not change the variables to which names resolve: either by aliasing with [upvar], [variable] or [global], or by changing variable name resolvers.
There is no opportunity to change a variable in an uncontrolled fashion, for example by [upvar] or [uplevel] from a called procedure or [eval] of unknown code in the current procedure.

There are no doubt other rules. It is best to think in terms of proving that code is safe, rather than detecting that it is unsafe. It may in some cases be possible to defer safety checks to interpretation time, and fall back on deoptimized code if the checks fail.

Once variable are lifted to registers, it becomes possible to think about proving useful assertions about the variable contents. For instance, in the procedure above, it is fairly easy to prove that the variable of prod can only be a 'double' - and hence it should be possible to substitute specialized operations that work only on 'double' values, effectively unboxing the value twice (once from the variable and once from the Tcl_Obj).

Finally, if we can infer types on a significant fraction of the operations, it would become profitable to think of generating direct machine code. If this can be done well, it would be quite a tour de force; compiling very high level languages without resorting to extensive programmer-supplied assertions is seldom feasible. It does look to be (just barely) feasible for Tcl, thanks to Tcl's considerable regularities.

All of this, of course, emerges from idle speculation at a conference. (Conferences are a good time to dream big dreams, and a less good time to commit to execution.) But perhaps some of the participants can comment further in this space.

MS has been thinking about the register machine modification; it seems that it might be able to provide a relatively inexpensive way to provide the first-level unboxing for compiled local variables that requires no further analysis. The idea/trick is essentially (recorded here so that it does not go away, and to ellicit commentary):

have a new Var flag bit VAR_INDIRECT, indicating that the value field has a (Tcl_Obj **) instead of a (Tcl_Obj *). Note that this requires an extension of the value union
teach all the var-related code how to deal with VAR_INDIRECT
assign one register slot to each local variable; store its value there, and let the Var be VAR_INDIRECT with a pointer to the register slot in its value

In this manner access to scalar local variables reduces to R/W access to the corresponding register. The question is what happens in the presence of local arrays, upvars and traces. The idea is:

a second variable upvars to this one: no change, we still R/W the register
local variable is an upvar to an untraced foreign scalar variable: the local Var is written as a link to the foreign Var, the foreign Var is made VAR_INDIRECT and its value is stored in the register. Things have to be restored properly on proc return! Note that this also works (details?) if the foreign Var is an array element
local arrays, traces, upvars to arrays or traced variables: the register slot stores 0x1 (or some other impossible value, tagged pointers), the Var is not VAR_INDIRECT. The bytecode falls back to regular local Var handling (as today)

Note that when upvaring to a foreign Var, if the foreign Var already was VAR_INDIRECT, the pre-existing indirection has to be removed and replaced with the new one.

I know this is vague ...

DKF: As an example of the sort of thing we were talking about, when a piece of code goes to pure bytecodes (i.e., no calls of non-bytecoded commands, especially including trace!) and only manipulates local variables (including no manipulation of linked variables created by the likes of global, upvar and variable), then it could be further compiled down to something much more efficient. Probably all the way to machine code. This can then be trivially extended further to the case where the code calls other commands that have had this treatment, and since we can do this determination all at runtime (we have the data, just not coded as such) we can extend this during runtime to cover much of the program. In the other cases, the current bytecode techniques could still be used; this is just an optimization strategy, so we don't need to solve every case.

Category Internals