MS's bytecode engine ideas

(MS: Review, additions, comments are welcome even though these are mainly notes to myself)

From bytecodes to wordcodes (or (4-byte)codes ?)

The idea is to have instructions coded in 4-byte (or machine words?) taking 4-byte operands; today, the instructions are coded in 1 byte, operands may be 1 or 4 bytes long.

Why

Bytecodes(BC) are for internal consumption: they are typically run by the same Tcl installation that compiled them. It makes sense to use the best internal representation for that installation. If import/export functionality (tclPro) is desired for the BC, it is that interface that should translate to/from a portable representation (and re-introduce space savings if the internal representation is bloated)
The actual bytecodes have probably a very small memory footprint compared to everything else that is going on in Tcl (I'm guessing!), no measurable negative impact to be expected in this sense
The BC engine is (slightly) more complex: there are many duplicate instructions, to deal with cases where an operand is represented in 1 or 4 bytes
Wordcodes allow different (and faster) instruction scheduling mechanisms (direct jumps from instruction end to the start of the next instruction): with runtime table lookups (msvc), or even without (gcc)
BC are run-time expensive: there is a lot of byte-to-word 0-padding going on at runtime
Speed tradeoff? More memory traffic, alignment effects (?)

How

First the easy one: make all operands wordsized. Slight retouches to the compiler, it is mainly data and parameters that need changes.
Abstract away the internal representation: need to replace 'unsigned char *pc' declarations with a macro, the Bytecode structure has to be abstracted too. Changes in many many places ...

Testing prototype: To run some tests, I may write an on the fly translator that takes the BC as presently compiled and delivers the new format, which is then run. This requires only the translator function and the wordcode engine, not more. It would give a lower bound for speed improvements.

2001-20-06: the wordcode engine is functioning well. The bad news: no measurable speed improvement; the good news: memory impact is indeed negligible.

Inlined procs and TAL (Tcl assembly language, see Parsing, Bytecodes and Execution, The anatomy of a bytecoded command, and [L1 ])

These two are not the same, but are closely related.

Here the idea is to be able to write portable code that can be optimised to run much faster than plain Tcl (tcllib could be an interesting field of application).

These constructs would be used only in time-critical parts of a program. Inline code does not only save the function call overhead, but is an enabler for some TAL constructs that could otherwise not be used. As an example, consider (again ...) Donal's trick with the K combinator

    set lst [lreplace [K $lst [set lst {}]] $idx $idx $newVal]

One would like to be able to define a proc Kset such that

    set lst [lreplace [Kset lst] $idx $idx $newval]

produces the same effect of the above. But this cannot be done without recurring to uplevel/upvar (which are slow mechanisms)

    proc Kset {lst} {
        upvar $lst lst0
        set lst $lst0
        set lst0 {}
        set lst
    }

A possible TAL syntax for Kset I have in mind would be (pseudocode!)

    proc Kset {lst} {
        upvar $lst lst0
        TAL::NOOP $lst0
        TAL::POP [set lst0 {}]
    }

where TAL::NOOP does nothing (its arguments stay on the stack), and TAL::POP removes its arguments from the stack; so the original value of lst0 stays on the stacktop and is the return value. But upvar is still there ...

So, the real winner would be to be able to insert those TAL codes directly into the bytecodes of the calling proc, but without requiring the programmer to do it by hand (or even know that it is happening) - hence the 'need' for inlined procs.

Assorted remarks

This looks like it can be implemented as an extension: it just requires the definition of new commands (TAL::inlineProc, TAL::NOOP, ...) that have a compile procedure - the tough one is the compile procedure for TAL::inlineProc, the others are probably simple
TAL does not need to, and probably can't really, expose every opcode of the internal engine; actually, the TAL pseudo-engine doesn't even need to be the same as the internal engine
In order for TAL to be really useful and fast, it may need some redesign of the engine (see other project below)
The tough part of this: deciding what can be inlined, and what can't. Inlining makes a mess of scoping rules and stack traces ...

A lower level engine with a hybrid stack

The bytecode engine could use some refactorisation to eliminate duplicated code - same goes for some of the non-engine internal procedures (I have tclVar.c in my mind, but there may be others). The changes also involve defining new lower level operations that would be useful/necessary for TAL programming.

In order to do this, the Tcl stack should be hybridised (polymorphised?). Currently it is a stack of (Tcl_Obj *) only (these are the values that Tcl variables can take); I am thinking in terms of also allowing other pointers on the Tcl stack, maybe even integers ...

To illustrate the idea: I'm thinking of being able to decompose the different INST_STORE instructions into

   (a) get a pointer to the Var structure (different addressing possibilities here)
   (b) do whatever you have to do to the variable and its value (generic code)
   (c) possibly different cleanups depending on the addressing mode

This would allow to hold on to the pointer to a variable (TAL code, or compiled code) and avoid repeated lookups.

Some categories of primitive instructions I am thinking of (this does look like reprogramming Tcl in Forth, doesn't it?):

Stack management - push, pop, dup, drop, swap, rot, peek, poke
Reference counts - incrRefCt, decrRefCt, getRefCt, isShared, dupObject
Scope management - (deal with the Tcl call stack, namespaces, ...)
Tcl_Obj ops - get & store (with append/lappend/replace flags), math-ops, ...
Var/Cmd ops - link, unlink, getVarPtr, getLinkVarPtr, getCmdPtr, getLinkCmdPtr, callTraces
Flow control - jumps (as today), foreach, return, error, catch, endCatch, invoke, eval, gosub
Integers (??) - integer basic arithmetic, integer-based flow control
Higher level ops - essentially, assemblies of the above (inlined as macros)

Remark that some higher level ops can be programmed with TAL/inline too!

The weirdest/newest one above is gosub; I am thinking in terms of executing some sequence of instructions and return to the calling point on termination. This requires something like Forth's return stack. I need such a construct also for the next project anyway ...

This duplicates a lot of the Tcl library; one possibility would of course be that some library functions become a front-end for the engine, another is to leave the duplication - which would suggest using some kind of code generator (in Tcl!) to keep everything nicely coordinated.

Note: vmgen [L2 ] may be an interesting tool for this task? Reading the paper ...

Non-recursive engine, tail jumps, going stackless, light threads, and similar insanities

The way Tcl is implemented is highly recursive at the C level; this poses migration problems when the target only supports small stacks (Palm, for instance).

There are actually (1 + 2n) stacks in Tcl, where n is the number of interpreters: the C stack, plus (for each interpreter) the engine's stack of Tcl_Obj and the interpreters callFrame stack. The idea is to get rid of the C stack wherever possible.

Presently, the engine generates recursive calls to itself whenever

   (i)   it invokes a proc 
   (ii)  it does INST_EVAL or INST_EXPR
   (iii) it touches a traced variable
   (iv)  it invokes a C function that invokes the engine, via ''eval'' for instance
   (v)   asynchronous calls are made

It is relatively easy to avoid these recursive calls in cases (ii) tested and (i), by incorporating some code from tclBasic.c and tclProc.c into the engine, creating local variables in an hybrid Tcl stack, and adding a return stack as described below. This technology is insufficient for (iii) and (iv), I haven't yet thought about (v).

The tested modification (coded using gcc's labels as values, straight C requires some more work but can be done) goes essentially as follows:

create a return stack RS in the engine
detect the points where tou can switch to executing a different bytecode: INST_EVAL, INST_EXPR, INST_INVOKE (if it is a proc's invocation); incorporate there the code that needs to be run before the actual bytecodes are executed; define a returnTarget with the code that needs to be executed on return
before switching to another bytecode, stash in the return stack the variables you will need on return: codePtr, pc, interp, returnTarget, ...
modify the exit code of the engine to do an internal return if the RS is not empty, and a real C return otherwise. On internal returns, you restore the saved variables and jump to the returnTarget.

Remark that (half) tail-calls are almost trivial to implement now, where it not for the mess they create in the call stack. Tcl tail calls have the same problems as inline procs (upvar, uplevel, ...)

One more stack then? Well, this is not really necessary: Tcl bytecodes are very disciplined, they always leave the Tcl stack exactly where they found it! That means that we could actually stash these things in the Tcl stack, and know that they will be there at the top when we get back. A NULL at stackTop could then signal an internal return.

Actually, given this discipline, much more can be put in the same Tcl stack structure: the catch stack, the decrRefStack (S4 uses one), RS. All of these things nest quite nicely ... We are then back to (1+2n) stacks, with a C stack that is lightly loaded, and more stuff in the Tcl stack. (Part of this was tested in a weird manner: not by putting everything into the Tcl stack, but by putting everything and the Tcl stack into the C stack. Runs dandy ...)

(iii) cannot be solved in this manner - except if we call the traces from within the engine! So, this would require incorporating some lower level functionality into the engine, refactor some of the tclVar.c code and bring some components in here ...

These things are somewhat reminiscent of Stackless Python (see [L3 ] for initial pointers; thanks Cameron!). Essentially the same construct is possible in Tcl, I'll describe here my (current, fragile, half-baked) understanding of how it could go:

some more info will need to go into the callFrame stack (maybe RS should also go in here, and not in the Tcl Stack?)
a call to eval a script/bytecode is now just a scheduler: (a) pushes a callFrame, (b) pushes data into RS, (c) returns to the main executor
the main executor just executes whatever it finds at the stacktop - asynchronous calls should get there too ...

This requires of course a different return convention than today's: the return code of a proc or script has to be a variable in interp (as the result already is), and not a C return value In this manner all C recursions (but type iv, which require C returns ...) are gone.

Once we are here, it seems also quite easy to incorporate a main scheduler that distributes time slices to different interpreters - effectively allowing different interpreters to run concurrently in a cooperative manner (light threads).

(iv) can't really be helped. In order to get the full benefit of recursion-less, external programs would have to be rewritten to conform to a new interface (still not pondered ...)

RS (not Return Stack ;-): I'm amazed by these ideas. I have always missed the ability to define all Tcl commands in Tcl (so far, only for and eval in terms of uplevel) - TAL looks like it may provide the needed primitives. This can turn into a great thing! The TAL engine could grow to be the real core of Tcl, which would learn how to 'set', 'proc', 'foreach' by sourcing TAL code...

Re portable representation: how about reusing UTF-8 functionality? Even 4-byte-wide opcodes could start with small numbers, so in "UTF" file storage format 00..7F just take up 1 byte, 80..FFF take two. And: no byte-order problems... (see UTF-8 bit by bit, Unicode and UTF-8)

JCW This is great stuff, IMO, Miguel. You may want to dive into Lua [L4 ] - if you haven't already. They moved from a stack machine to a register machine (from 4.0 to 4.1a, IIRC), and reported 15% performance gains. They also always have used 32-bit words (well, on the major platforms, Lua can also go to much smaller architectures). And finally, Lua has always included an assembly-level access (symbolic, of course), which I really liked. It's "hidden" in one of the debug or test libraries - probably to prevent everyone going overboard with this.

(JCW again) As for (v), i.e. async calls: how about defining a new Tcl object which has no string representation. It represents a "future", i.e. a value which becomes real at some point in time. When accessing the value in a normal way, the evaluator blocks, waiting for completion. When accessed in some special way, one could peek inside and check whether it is a future, whether it is still in progress, cancel it, etc. One could even turn all of Tcl into a data flow system, or perhaps an ASM (Abstract State Machine), by always returning results from every call this way: "set a [foo $bar]" would start foo, set a to a future representing the result of foo, and then continue processing. The minute the value of a is used, the script would suspend at that particular spot, letting the rest progress, until the value of a is ready. The more practical implication of this, is that async I/O becomes far more transparent, because "set a [read $fd]" could be an async call, with processing continuing as usual.

CMcC 20040726: future defines a Tcl_ObjType which has some of the properties suggested by JCW. Just a fun project for an afternoon.

Lars H: I quite agree. To avoid using the C stack when it isn't necessary certainly seems like a good thing. As for (v) async calls: are these async as "completion calls from e.g. asyncronuous I/O operations" or more benign events (might happen in any order, but only at an update or similar)? It seems to me that if one avoids using the C stack then the latter kind of things would become more well-behaved, since vwaits wouldn't need to nest in the way that they do today. The futures of JCW is also a very exciting idea, possibly making the language more powerful (in some heavy CS sense). I'm not sure I would like to have all Tcl commands return futures by default, but one could always provide for each core command a futuristic counterpart that would return futures whenever possible, and maybe even use namespace import for choosing which set one wishes to have in the global namespace.

Roy Terry: Please forgive a practical plug here. For anyone reading this who is concerned about the overhead of calling procs with upvar, uplevel, global, etc. There is Tmac - a Tcl macro processor package which lets you define proc-like code blocks that are expanded "in-line". You can even generate the code block dynamically using a filter proc. It's always nice to have a fast base language but macros can offer some relief even now.

26jul04 jcw - Not sure what the status of bytecode->wordcode experiments is, but here's another option: both. Put all bytecodes in one vec, all words in another. Advance two independent pointers when running. It uses up one more iterator of which the state must be saved, but both iterators now operate on their native datatype (i.e. *bp++ and *wp++). Jumps need to set them both. That need not be more elaborate than: "wp += *bp++; bp += *bp;". Not sure how effective it is, in terms of on-chip cache effects for example, just wanted to get the idea onto this page.

Category Internals