(MS: Review, additions, comments are welcome even though these are mainly notes to myself)
From bytecodes to wordcodes (or (4-byte)codes ?)
The idea is to have instructions coded in 4-byte (or machine words?) taking 4-byte operands; today, the instructions are coded in 1 byte, operands may be 1 or 4 bytes long.
Why
How
Testing prototype: To run some tests, I may write an on the fly translator that takes the BC as presently compiled and delivers the new format, which is then run. This requires only the translator function and the wordcode engine, not more. It would give a lower bound for speed improvements.
2001-20-06: the wordcode engine is functioning well. The bad news: no measurable speed improvement; the good news: memory impact is indeed negligible.
Inlined procs and TAL (Tcl assembly language, see Parsing, Bytecodes and Execution, The anatomy of a bytecoded command, and [L1 ])
These two are not the same, but are closely related.
Here the idea is to be able to write portable code that can be optimised to run much faster than plain Tcl (tcllib could be an interesting field of application).
These constructs would be used only in time-critical parts of a program. Inline code does not only save the function call overhead, but is an enabler for some TAL constructs that could otherwise not be used. As an example, consider (again ...) Donal's trick with the K combinator
set lst [lreplace [K $lst [set lst {}]] $idx $idx $newVal]
One would like to be able to define a proc Kset such that
set lst [lreplace [Kset lst] $idx $idx $newval]
produces the same effect of the above. But this cannot be done without recurring to uplevel/upvar (which are slow mechanisms)
proc Kset {lst} { upvar $lst lst0 set lst $lst0 set lst0 {} set lst }
A possible TAL syntax for Kset I have in mind would be (pseudocode!)
proc Kset {lst} { upvar $lst lst0 TAL::NOOP $lst0 TAL::POP [set lst0 {}] }
where TAL::NOOP does nothing (its arguments stay on the stack), and TAL::POP removes its arguments from the stack; so the original value of lst0 stays on the stacktop and is the return value. But upvar is still there ...
So, the real winner would be to be able to insert those TAL codes directly into the bytecodes of the calling proc, but without requiring the programmer to do it by hand (or even know that it is happening) - hence the 'need' for inlined procs.
Assorted remarks
A lower level engine with a hybrid stack
The bytecode engine could use some refactorisation to eliminate duplicated code - same goes for some of the non-engine internal procedures (I have tclVar.c in my mind, but there may be others). The changes also involve defining new lower level operations that would be useful/necessary for TAL programming.
In order to do this, the Tcl stack should be hybridised (polymorphised?). Currently it is a stack of (Tcl_Obj *) only (these are the values that Tcl variables can take); I am thinking in terms of also allowing other pointers on the Tcl stack, maybe even integers ...
To illustrate the idea: I'm thinking of being able to decompose the different INST_STORE instructions into
(a) get a pointer to the Var structure (different addressing possibilities here) (b) do whatever you have to do to the variable and its value (generic code) (c) possibly different cleanups depending on the addressing mode
This would allow to hold on to the pointer to a variable (TAL code, or compiled code) and avoid repeated lookups.
Some categories of primitive instructions I am thinking of (this does look like reprogramming Tcl in Forth, doesn't it?):
Remark that some higher level ops can be programmed with TAL/inline too!
The weirdest/newest one above is gosub; I am thinking in terms of executing some sequence of instructions and return to the calling point on termination. This requires something like Forth's return stack. I need such a construct also for the next project anyway ...
This duplicates a lot of the Tcl library; one possibility would of course be that some library functions become a front-end for the engine, another is to leave the duplication - which would suggest using some kind of code generator (in Tcl!) to keep everything nicely coordinated.
Note: vmgen [L2 ] may be an interesting tool for this task? Reading the paper ...
Non-recursive engine, tail jumps, going stackless, light threads, and similar insanities
The way Tcl is implemented is highly recursive at the C level; this poses migration problems when the target only supports small stacks (Palm, for instance).
There are actually (1 + 2n) stacks in Tcl, where n is the number of interpreters: the C stack, plus (for each interpreter) the engine's stack of Tcl_Obj and the interpreters callFrame stack. The idea is to get rid of the C stack wherever possible.
Presently, the engine generates recursive calls to itself whenever
(i) it invokes a proc (ii) it does INST_EVAL or INST_EXPR (iii) it touches a traced variable (iv) it invokes a C function that invokes the engine, via ''eval'' for instance (v) asynchronous calls are made
It is relatively easy to avoid these recursive calls in cases (ii) tested and (i), by incorporating some code from tclBasic.c and tclProc.c into the engine, creating local variables in an hybrid Tcl stack, and adding a return stack as described below. This technology is insufficient for (iii) and (iv), I haven't yet thought about (v).
The tested modification (coded using gcc's labels as values, straight C requires some more work but can be done) goes essentially as follows:
Remark that (half) tail-calls are almost trivial to implement now, where it not for the mess they create in the call stack. Tcl tail calls have the same problems as inline procs (upvar, uplevel, ...)
One more stack then? Well, this is not really necessary: Tcl bytecodes are very disciplined, they always leave the Tcl stack exactly where they found it! That means that we could actually stash these things in the Tcl stack, and know that they will be there at the top when we get back. A NULL at stackTop could then signal an internal return.
Actually, given this discipline, much more can be put in the same Tcl stack structure: the catch stack, the decrRefStack (S4 uses one), RS. All of these things nest quite nicely ... We are then back to (1+2n) stacks, with a C stack that is lightly loaded, and more stuff in the Tcl stack. (Part of this was tested in a weird manner: not by putting everything into the Tcl stack, but by putting everything and the Tcl stack into the C stack. Runs dandy ...)
(iii) cannot be solved in this manner - except if we call the traces from within the engine! So, this would require incorporating some lower level functionality into the engine, refactor some of the tclVar.c code and bring some components in here ...
These things are somewhat reminiscent of Stackless Python (see [L3 ] for initial pointers; thanks Cameron!). Essentially the same construct is possible in Tcl, I'll describe here my (current, fragile, half-baked) understanding of how it could go:
This requires of course a different return convention than today's: the return code of a proc or script has to be a variable in interp (as the result already is), and not a C return value In this manner all C recursions (but type iv, which require C returns ...) are gone.
Once we are here, it seems also quite easy to incorporate a main scheduler that distributes time slices to different interpreters - effectively allowing different interpreters to run concurrently in a cooperative manner (light threads).
(iv) can't really be helped. In order to get the full benefit of recursion-less, external programs would have to be rewritten to conform to a new interface (still not pondered ...)
RS (not Return Stack ;-): I'm amazed by these ideas. I have always missed the ability to define all Tcl commands in Tcl (so far, only for and eval in terms of uplevel) - TAL looks like it may provide the needed primitives. This can turn into a great thing! The TAL engine could grow to be the real core of Tcl, which would learn how to 'set', 'proc', 'foreach' by sourcing TAL code...
Re portable representation: how about reusing UTF-8 functionality? Even 4-byte-wide opcodes could start with small numbers, so in "UTF" file storage format 00..7F just take up 1 byte, 80..FFF take two. And: no byte-order problems... (see UTF-8 bit by bit, Unicode and UTF-8)
JCW This is great stuff, IMO, Miguel. You may want to dive into Lua [L4 ] - if you haven't already. They moved from a stack machine to a register machine (from 4.0 to 4.1a, IIRC), and reported 15% performance gains. They also always have used 32-bit words (well, on the major platforms, Lua can also go to much smaller architectures). And finally, Lua has always included an assembly-level access (symbolic, of course), which I really liked. It's "hidden" in one of the debug or test libraries - probably to prevent everyone going overboard with this.
(JCW again) As for (v), i.e. async calls: how about defining a new Tcl object which has no string representation. It represents a "future", i.e. a value which becomes real at some point in time. When accessing the value in a normal way, the evaluator blocks, waiting for completion. When accessed in some special way, one could peek inside and check whether it is a future, whether it is still in progress, cancel it, etc. One could even turn all of Tcl into a data flow system, or perhaps an ASM (Abstract State Machine), by always returning results from every call this way: "set a [foo $bar]" would start foo, set a to a future representing the result of foo, and then continue processing. The minute the value of a is used, the script would suspend at that particular spot, letting the rest progress, until the value of a is ready. The more practical implication of this, is that async I/O becomes far more transparent, because "set a [read $fd]" could be an async call, with processing continuing as usual.
CMcC 20040726: future defines a Tcl_ObjType which has some of the properties suggested by JCW. Just a fun project for an afternoon.
Lars H: I quite agree. To avoid using the C stack when it isn't necessary certainly seems like a good thing. As for (v) async calls: are these async as "completion calls from e.g. asyncronuous I/O operations" or more benign events (might happen in any order, but only at an update or similar)? It seems to me that if one avoids using the C stack then the latter kind of things would become more well-behaved, since vwaits wouldn't need to nest in the way that they do today. The futures of JCW is also a very exciting idea, possibly making the language more powerful (in some heavy CS sense). I'm not sure I would like to have all Tcl commands return futures by default, but one could always provide for each core command a futuristic counterpart that would return futures whenever possible, and maybe even use namespace import for choosing which set one wishes to have in the global namespace.
Roy Terry: Please forgive a practical plug here. For anyone reading this who is concerned about the overhead of calling procs with upvar, uplevel, global, etc. There is Tmac - a Tcl macro processor package which lets you define proc-like code blocks that are expanded "in-line". You can even generate the code block dynamically using a filter proc. It's always nice to have a fast base language but macros can offer some relief even now.
26jul04 jcw - Not sure what the status of bytecode->wordcode experiments is, but here's another option: both. Put all bytecodes in one vec, all words in another. Advance two independent pointers when running. It uses up one more iterator of which the state must be saved, but both iterators now operate on their native datatype (i.e. *bp++ and *wp++). Jumps need to set them both. That need not be more elaborate than: "wp += *bp++; bp += *bp;". Not sure how effective it is, in terms of on-chip cache effects for example, just wanted to get the idea onto this page.