Tcl Performance - taking it to the next level
The purpose of this page is to provide an anchor point to discuss some half-formed (half-baked?) ideas that various Tcl'ers (including Miguel Sofer, Donal Fellows and Kevin Kenny) discussed, largely in the hallway and the bar, at the 2007 Tcl conference in New Orleans.
The basic idea is that Miguel's work on bytecode engine performance, variable reform, and so on, can only go so far. Eventually some more radical reforms will be needed. These reforms can't be undertaken lightly. For one thing, they will break compatibility with tbcload when they are introduced (although it is possible that a code translator can restore compatibility). Nevertheless, if Tcl is to go much faster, bytecode reform looks like nearly the only path forward. We have nearly exhausted the improvements gained by bytecoding individual commands; indeed, a useful subset of Tcl procedures now converts to 100% bytecode, with no callouts to command procedures.
Some of the possible improvements are fairly easy and benign:
Larry Smith I'd like to see that cite. Since the entire virtual machine must operate from memory, I don't see how structuring a var as a stack offset is different from having it an int or long somewhere.
MJ - A lot of discussion and benchmarking went into this when the Lua guys switched to a register based in 5.0 see for instance [L1 ]. A google search of "register vs stack based vm" turns up some more interesting links. (like [L2 ]).
TP One register-based VM, parrot is already being used by the Tcl implementation ParTcl. Perhaps the authors of ParTcl can chime in with recent developments or benchmarks.
KBK The chief advantage to the register-based approach appears to be the fact that local variables can go into registers. Doing so saves a tremendous amount of data motion. For an operation like [set a [+ b c]]], a stack VM has to generate something like:
push b ; essentially, *--sp=b 1 load, 1 store push c ; similarly, *--sp=c 1 load, 1 store add ; sp[1] += *sp; sp++; 2 loads, 1 store store a ; a=*sp++; 1 load, 1 store Total: 5 loads, 4 stores
This sequence transfers a large amount of data to and from the stack. By contrast, if local variables can be stored in registers, the sequence is just
add r[b], r[c] -> r[a] 2 loads, 1 store
This presumes three-address code. For two-address code, it's more like:
move r[b] -> r[a] 1 load, 1 store add r[c] -> r[a] 1 load, 1 store Total, 2 loads, 2 stores
The latter has one redundant data motion, and takes two instructions instead of one; nevertheless, the fact that instructions can be more compact may make up for it.
MS remarks that in our case the difference is actually larger as these are Tcl_Obj values: each push to the stack also involves an incr of the Tcl_Obj's refcount, each pop requires a decr. So, in terms of actual loads and stores in the cpu, and considering
15 loads, 10 stores stack-based 6 loads, 3 stores 3-address register machine 8 loads, 6 stores 2-address register machine
(I may have miscounted though)
Larry Smith Generating code for register machines can be a huge headache. Once you start with registers you will never have enough, and then the compiler must deal with how to spill registers back to memory. I've yet to see a compiler do that job gracefully and still generate code shorter than a stack machine.
Besides, your comparison makes no sense to me. "push b" is a load op, but it shouldn't count as a store because a stack machine simply loads it onto the stack (from wherever it came from, usually the instruction stream). This is especially relevant to a stack architecture that keeps the upper stack levels in a stack-register file on-chip. Yes, when you write it out in C code it does become 1 load, 1 store, but that is an implementation issue, not a fault of the stack machine.
KBK I was talking about the data motion in the underlying C implementation. We're not designing hardware here, we're designing a virtual machine. And the Tcl execution stack is not the hardware stack.
Even without storing variables in registers, register-oriented bytecoding may have something of an advantage:
load b -> r1 1 load, 1 store add c -> r1 2 loads (counting accessing r1 again), 1 store store r1 -> a 1 load, 1 store total: 4 loads, 3 stores
Larry Smith I like bytecodes, they are compact. To deal with in-line data, like pointers or integers, use a prefix or suffix data table. The byte code need not specify a data item in line, it need only use the next byte to indicate an offset into the data table. KBK Tcl's engine does that when it can, but things overflow the first 256 data table entries pretty quickly!
Larry Smith If the interpreter runs out of data table entries so quickly, why would anyone think it won't run out of registers even more quickly, thereby bogging down spilling registers to memory.
I should also point out that the 256 limit was solved years ago - the pascal p-code system and the smalltalk systems all implement escapes. One way would be to support 128 table entries directly, and when the number is higher the interpreter takes the remaining 7 bits, concatenates them to the next byte, so you have 15 bits giving you 32K entries. This system only penalizes the interpreter for code that is overly-long.
KBK Of course I was talking about a virtual machine that has essentially an arbitrary number of registers. I would anticipate that code with more than 256 live values at any given time to be fairly rare.
Larry SmithI believe most C compilers, including gcc, compile switch statements as a series of if/else tests, then try to tidy them up with an optimization pass later in code gen. (DKF: I've never ever seen that sort of code out of gcc, even in debug mode. That makes me suspect you're way wrong there.)[It's entirely possible. I haven't looked at gcc's internals since...since...well, a long time ago, anyway] C has no nice, clean case statement like Pascal, where the statement can be coded as a jump table. However, you can code a jump table using a constant array of pointers to the relevant functions. The inner loop of the interpreter then becomes "for(;;) { pc = jumptable(pc) }" where "jumptable" is an array of functions implementing each bytecode which return the address of the next instruction to be executed. If you do a setjump and have the exit opcode call longjmp, there are no conditions that the inner loop need detect, it makes for a very fast code. Doing this also means that the jumptable and inner loop code all fit into cache memory, meaning the only cache misses come from data. If your code is compact enough to fit all into cache as well, you will have all the software machinery of the interpreted program running from cache without misses. This could turn out to be faster than native code that can often miss the cache and which will also require the vm system to page code and and out of memory.
KBK You are describing the particular sort of threaded code called 'JSR threading'. It's certainly a possibility that's on the table. It scares us a little bit for a couple of reasons:. First, C compilers are very bad at optimising code in the presence of setjmp (it's a very hard problem).
Larry Smith True except for one thing: the longjump is used only to shut off the interpreter. There is only one "exit" or "quit" opcode, and that's the only thing that will need to break the interpreter's inner loop. The exit opcode only needs a way to unwind the stack and set the interpreter to a known state. If if the "quit" opcode is allowed to just call exit() then we don't even need the setjump. In any event, it need not be optimized.
KBK You know that. I know that. But the compiler is likely to turn off a number of optimizations in any activation record that contains a setjmp(), because it doesn't know that. [Then just call exit(0), and register any clean up code with at_exit()]
KBK Second, attempts to clean up the code in the bytecode interpreter by refactoring out function calls have foundered. Function calls out of the Tcl_ExecuteByteCode loop are disproportionately expensive on several platforms.
Larry Smith I don't see what you mean by "refactoring function calls have foundered". It is true that some architectures do make calls very expensive. On such architectures we should take advantage of the statelessness if the stack architecture and the fact that the interpreter does not really need to use a call since it has no intention of calling the opcode from other contexts and could just as well use gotos to branch to the opcode routine and to get back. This avoids the overhead of a full call. We are still left with the problems of using the switch statement, but coding the inner loop in asm is not a huge sacrifice to get it as fast as possible. It is (or should be) a very small piece of code.
KBK We're at something of a loss to explain why they take so much time, but measurements have produced quite reliable evidence that they do. We welcome explanations. (Supported with evidence, please! Idle speculation is unlikely to be helpful; the maintainers have trodden this ground rather thoroughly.)
Beyond that, we get into a realm where we need to begin to consider the safety of optimisations. The next big win looks to appear from the unboxing of local variables; essentially, in the (overwhelmingly common) case where a variable local to a procedure has no aliases and is not traced, we can eliminate the variable object altogether. Together with that, the overhead of checking for traces goes away, as does the overhead of repeated resolution of the variable name (the caching of the mapping from variable name to variable object is less effective than it might be).
For a concrete example, let's examine a procedure for computing the inner product of two vectors:
proc dot {a b} { set prod 0.0 foreach x $a y $b { set prod [expr {$prod + $x*$y}] } return $prod }
What we need to do, to promote local variables to registers, is to ascertain that
There are no doubt other rules. It is best to think in terms of proving that code is safe, rather than detecting that it is unsafe. It may in some cases be possible to defer safety checks to interpretation time, and fall back on deoptimized code if the checks fail.
Once variable are lifted to registers, it becomes possible to think about proving useful assertions about the variable contents. For instance, in the procedure above, it is fairly easy to prove that the variable of prod can only be a 'double' - and hence it should be possible to substitute specialized operations that work only on 'double' values, effectively unboxing the value twice (once from the variable and once from the Tcl_Obj).
Finally, if we can infer types on a significant fraction of the operations, it would become profitable to think of generating direct machine code. If this can be done well, it would be quite a tour de force; compiling very high level languages without resorting to extensive programmer-supplied assertions is seldom feasible. It does look to be (just barely) feasible for Tcl, thanks to Tcl's considerable regularities.
All of this, of course, emerges from idle speculation at a conference. (Conferences are a good time to dream big dreams, and a less good time to commit to execution.) But perhaps some of the participants can comment further in this space.
MS has been thinking about the register machine modification; it seems that it might be able to provide a relatively inexpensive way to provide the first-level unboxing for compiled local variables that requires no further analysis. The idea/trick is essentially (recorded here so that it does not go away, and to ellicit commentary):
In this manner access to scalar local variables reduces to R/W access to the corresponding register. The question is what happens in the presence of local arrays, upvars and traces. The idea is:
Remark that on normal operation the bytecode engine just accesses the register. If it contains a tagged pointer it reverts to the current standard behaviour, otherwise it operates on the register's Tcl_Obj* directly.
Note that when upvaring to a foreign Var, if the foreign Var already was VAR_INDIRECT, the pre-existing indirection has to be removed and replaced with the new one.
I know this is vague ...
KBK No, it's actually fairly lucid, or perhaps your explanation in New Orleans was clearer. That approach is among the 'low-hanging fruit' in that it doesn't break existing codes (except for naughty ones that reach into Var structures). It's a very nice way to solve part of the problem withough needing to do the hard work of data flow analysis. I apologize for forgetting it when I drafted this summary.
DKF: As an example of the sort of thing we were talking about, when a piece of code goes to pure bytecodes (i.e., no calls of non-bytecoded commands, especially including trace!) and only manipulates local variables (including no manipulation of linked variables created by the likes of global, upvar and variable), then it could be further compiled down to something much more efficient. Probably all the way to machine code. This can then be trivially extended further to the case where the code calls other commands that have had this treatment, and since we can do this determination all at runtime (we have the data, just not coded as such) we can extend this during runtime to cover much of the program. In the other cases, the current bytecode techniques could still be used; this is just an optimization strategy, so we don't need to solve every case.
KBK The "no calls to non-bytecoded commands" condition is far too strong. As long as we know what the non-bytecoded commands do, we're fine. We don't have to bytecode all of the commands that simply accept values and return results.
DKF: OK, there's a safety property to construct, together with a trivial way of lifting it to a procedure (a procedure is safe if it only calls safe commands and accesses safe variables). The interesting bit is whether we can extend that so that safe uses of (some) unsafe commands doesn't cause a problem with compilation either...
George Peter Staplin: I've said this for years, and I'll say it again -- I think the VM should be generated by a Tcl script. The Tcl script should be capable of embedding prologue and epilogue code for debugging and performance monitoring, as well as the ability to generate several different kinds of VM. This way we can experiment and we don't have to spend hours only to realize that by choosing a register machine we have a negative performance impact with some machines.
DKF: At the experimental stage? I agree. If we can manage to work out a good set of safety and correctness constraints? We can get more creative. (Remember, we don't need to handle everything. Falling back to bytecode is permitted. This means we can be conservative and punt when things get hard.)
KBK: No arguments about having the VM be generated code. The current one shows the limitations of the C preprocessor quite clearly; we need something stronger. But the choice of a register machine as opposed to a stack machine also cuts across all the compilation procedures, and the generation of those is a lot harder to automate.
JR: Before tossing out the stack VM, have well-known techniques for improving stack machine performance (such as peephole optimization of the bytecode or caching the TOS in a register) been looked at too?
And as long as the topic is radical reform, how about eliminating the C stack ala stackless python, or other techniques for enabling lightweight tasks?
DKF: We're already doing some peepholing, but could theoretically do more (the difficulty is that it is very hard to relocate and reorder Tcl bytecodes right now, for ugly reasons). We also cache the TOS (KBK and Next-On-Stack as well!) already. We're not willing to do the C stack elimination because it makes life much nastier for people writing extensions (and in any case, we've got real threads and can take proper advantage of multicore processors, unlike systems that do green-threads tricks) and we really want to continue to support things like Expect and SQLite which have non-trivial embeddings of the interpreter (or at least they're non-trivial when you're working with changing use of the C stack...)
AMG: (Wow, an edit conflict! Sorry, DKF.) I'd like to see this feature (stack elimination), since it would pave the way for coroutines. Briefly, I want a [yield] command which registers its continuation as an idle task (or similar) and returns to the event loop. This would make Keep a GUI alive during a long calculation a lot easier; basically it implements cooperative multitasking within a single thread.
However, eliminating the C stack is incompatible with C extensions which call back into Tcl. Calls into Tcl_Eval() and the like would have to somehow work in reverse, much like [yield]. The C code (unknowingly) returns into the Tcl engine, Tcl performs the action requested by the C code, Tcl resumes the C code where it left off, and the C code is none the wiser. I did manage to do something like this using getcontext(), setcontext(), makecontext(), and swapcontext(), so it's not impossible.
Currently the evaluation context lives on the C stack; I believe JR is proposing to move it into the Tcl_Interp.
AK: Where do these *context() functions come from ? If we really do something in that area (not likely) when there is also the question if it would make sense to expose the async Tcl_Eval() in the public API. async Tcl_eval() would be a Tcl_Eval() not only taking the script to run, but also a C level callback run when the script is done, with the full result ... In essence the continuation.
KBK The *context() functions are in the Single Unix Specification - http://www.opengroup.org/pubs/online/7908799/xsh/getcontext.html is a starting point.
AMG, to AK: An (exposed) asynchronous Tcl_Eval() doesn't currently exist, so of course no existing extensions use it. Also, in practice it would be difficult to use, since all functions that call into Tcl would have to be split up somehow, either into multiple functions or into multiple if or switch arms selected by a "stage"-type argument. By the way, this is the same difficulty we have with Keep a GUI alive during a long calculation. But all that aside, an asynchronous Tcl_Eval() would succeed in preventing the Tcl engine (right word choice?) from being called recursively. In my mind at least, this is all a little bit like the "update" problem.
JR: yes, moving the evaluation context from the C stack to the Tcl_Interp context is what I am thinking of. *context() is one way around it, hideous setjmp/longjmp hacks (is there any other way to use longjmp?) are another (see [L4 ] for vaguely related work on trampolines). Threads are one type of task, but they're not lightweight; creating thousands is usually ill-advised (yes I've been reading a bit too much about erlang recently). Coroutines are more like what I want, but my real motivation is vwait - I want to be able to vwait on a condition inside a callback and still have other callbacks work normally; in the current world doing that is asking for trouble. Re: peepholing - is it possible to do some kind of optimization for the lreplace shared Tcl_Obj problem? Or maybe a new engine could deal with it more naturally (I think the points about assertions on variables above touch on this). MS Notes that (a) this is slightly out of place here, and (b) the shared obj problem is non-solvable in general (COW), but maybe some of the tricks in the page on K are just what you need.
CGM: I want to support the idea that enabling coroutine-switching integrated with the event loop would be an enormous advance. I wrote a news post on this a while back: http://groups.google.co.uk/group/comp.lang.tcl/msg/e1553f7c2b23d7ff
KBK: "Coroutines integrated into the event loop" sounds like a euphemism for "Green Threads."
AMG: Maybe so, but I'm primarily interested in the ability to explicitly specify where context switches can happen, which is pretty much the opposite of what happens with true threads: programs specify where context switches can not happen (at least, not to certain threads). Anyway, see tcor - A Tcl Coroutines Extension. Also think about applications like iterators/generators and handlers for events and I/O, and tell me if your first impulse is to implement them using something called "threads". Probably the answer is no, even though we currently implement them using coroutines, albeit coroutines implemented as procs whose "closures" are explicitly loaded from and written to global state variables. [It's probably time to move this discussion to another page.]
MS notes that Tcl8.6 (from a2 onwards) has a trampoline-based C-stack-eliminating engine (see NRE). Tailcalls are available as ::tcl::unsupported::tailcall (may get supported before 8.6.1), coroutines are in the works.
SYStems Well, first since DRH actually switched sqlite from a stack based vm to a register based vm, and as far as I know he is also part of Tcl's core team. I think he should be invited to share his input on this issue.
On the other hand, for virtual machine part, Tcl have two ways to go. It can either improve it's own or integrate into one of the existing ones, more precisely
A language is in my opinion grammar and vocabulary, grammar is the language features OO, Coroutines, built-in concurrency etc. ... vocab are the libraries, I find the approach followed by the .Net framework to be optimal, one set of vocabulary in the form of a single vocab libraries and different sets of grammar in the form of language, which strongly support the idiom of using the best tool for the job, while allowing better capitalization on the vocab/libraries built over time.
On the other hand by moving to a different VM, I wonder does Tcl lose the 2 languages framework, using C for the fast parts Tcl for the rest. I also wonder, why don't we just move more code to C to make Tcl scripts faster!
DKF: The problem with using another VM is that Tcl has rather different variable semantics to most other languages due to deep support for traces. That makes life much more difficult. I also think it should be a requirement that Tcl's (excellent) foreign function interface be retained in the mainline code; that's a major differentiator between Tcl and other many other languages (doing this would also mean that we wouldn't need to port masses of extensions; having to rewrite the world would suck).
DKF: The current status of this is that we're exploring generating native code from Tcl bytecodes rather than fiddling around going to a different VM. This is a complex undertaking! We're using LLVM (via a lightly-modified llvmtcl) as the back-end optimiser and code generator — it does a wonderful job even with pretty crude transformations of code — and tclquadcode (which builds on tclbdd) is our type inference engine, allowing us to determine what the types in a piece of code are with only minimal extra type annotation. Current progress indicates that a speedup on moderately-mathematical code of 16 times (!!!) is clearly achievable, putting Tcl close to the speed of optimised C or Fortran code. Which is awesome.
There is a long way still to go.