suspend and resume

Lars H, 2008-08-27: The following was first posted to the tcl-core mailing list, as a kind of well the NRE coroutines may be nice but it might actually be possible to do something even more powerful…

The main reason I'm putting it here now is that I actually managed to produce a very elementary proof of concept: suspend and resume in pure Tcl.

The specification

(Passages in italic have been added, to clarify things that were obscure in or omitted from the original posting.)

Basic premise

Primarily I wanted a non-nesting vwait — one which would not hang in

 after 1000 {
    after 1000 {set ::a 1; vwait ::b; puts "Finished"}
    vwait ::a
    set ::b 1
 }

(Note that it's not primarily about doing the above, although it could perfectly well happen in complicated situations, but about getting a vwait that is capable of doing it if required to do so.)

This meant the vwait would somehow have to save away what was currently in the C call stack, drop out into the outer event loop and process events there for a while, until the variable has been set, at which point everything would have to be put back so that processing could continue.

Also, this would have to be accomplished "within the box" — no fancy C compiler or OS features could be relied on, and recursive calls should still be supported.

Solution

Since there's no safe way to operate on the C stack from outside, the only way to get it out of the way is to explicitly unwind it — make every C function in there return to its caller. That's not too different from having an error propagate down the call stack, so:

: Yielding is done by returning with a new return code, TCL_SUSPEND, say.

When a yielding unwind is in progress, the interpreter result is a list. Each entity in a call chain receiving a TCL_SUSPEND result code from a call it made must do one of the following:

Append to the interp result a list element with sufficient information that it can later reconstruct the exact state it was in immediately before receiving the TCL_SUSPEND, and thus resume processing from where it was suspended. Then return to caller with TCL_SUSPEND code.
Throw an error "Cannot be suspended".
Catch the yield, if the entity is prepared to handle it. In this case, it must append the command (list of words after substitution) it called, so that we know from the continuation where to start when resuming.

Normally, a yield would proceed all the way down to the main event loop, where it gets handed off to bgerror (or the like), and that command is then responsible for arranging so that the suspended calls are resumed at a suitable point in the future.

A tricky detail is that there would now have to be two entry points for the C implementation of each Tcl command: one which is used when calling it normally, and one which is used when resuming it. I'll return to that matter below, but for now, let's look at what the script level behaviour is supposed to be.

Tcl API

I imagined the Tcl side of this could be handled with only two commands (although catch and return would probably figure as prominently as with any new control structure):

  suspend ?arg ...?
  resume continuation ?-code code? ?-level level? ?...? ?result?

suspend stops execution at some point and causes the call stack to unwind, whereas resume resumes it. If you do

  proc A {args} {list [suspend {*}$args] "in A"}
  catch {A foo bar} res

then the result will be TCL_SUSPEND and $res will be a continuation that contains the arguments of the suspend in its 0th element, i.e., it looks like

  {foo bar} {...

where the ... denotes the beginning of the serialization of the internal state of the call stack entity corresponding to the A procedure (yes, that is potentially quite a mouthful).

The resume arguments following the continuation are supposed to be the same as for return, but control how the original suspend behaves when resumed. Thus if one then does

  resume $res "Return from with"

then the result will be the list

  {Return from with} {in A}

because resume put back the A proc and the suspend command, the latter returns with TCL_OK (since there was no other -code specified), the list in A is called as

  list {Return from with} "in A"

and the result of that becomes the result of A as a whole, which is then also the result of the resume.

suspend services

(This subsection was added later.)

While not a necessary part of the proposal (i.e., a Tcl core equipped to support suspend and resume wouldn't need to know about it), it seems practical to introduce the convention that the first argument of suspend names a "service" that is requested from some TCL_SUSPEND handler higher up on the call stack. This gives suspend the appearance of an ensemble command, although the actual interpretation of these subcommands would happen rather far from the point of the suspend command.

Useful services when running under the event loop are:

suspend after ms: Pause for (at least) ms milliseconds, allowing other events to be processed in between.
suspend vwait varname ?varname ...?: Wait until one of the specified variables have been modified, then resume processing.

For those who prefer the LISPish counterpart call-cc, that can be provided as another service:

suspend call-cc cmdname ?arg ...?: Arrange for the current continuation (cc) to be built, and then call the cmdname command as
cmdname ?arg ...? continuation: in whatever context the handler providing that call-cc service lives in. The continuation is the current continuation up to that point (might thus be a partial continuation, for those who care about such details), so if you use resume as cmdname then this is effectively a no-op.

catch extension

(This subsection was added later.)

Since TCL_SUSPEND is a return code, all catches would by default catch it. This is in principle not a problem (even an advantage — see Intervention below), but it would cause problems for control structures coded in Tcl, since they would become incompatible with suspend and resume until updated to handle TCL_SUSPEND (this is comparable to the effects of introducing return -level in Tcl 8.5). However, one could ease that transition by modifying catch so that only a TCL_SUSPEND-aware form of the command actually catches this return code. (Preventing catch from catching everything has a precedent in the interp limit framework.)

As it turns out, extending catch serves also another purpose, namely to provide control sequences coded in Tcl with some way of determining the continuation items needed to get back into the script this catch was applied to. The new form of catch could then have the syntax

: catch script resvar optvar backvar

where the backvar would be set to a list of items such that catching a TCL_SUSPEND with

  catch $script res opt back

and then in the same procedure rethrowing it with

  return -code suspend [concat $res $back]

is equivalent to not catching it at all. This gives the following skeleton for a control structure that handle one suspend service "bar" but let everything else through:

 proc barable {script} {
    set code [catch $script res opt back]
    if {$code != $::TCL_SUSPEND} then {
       return -options $opt -level [expr {[dict get $opt -level]+1}] $res
    }
    switch -- [lindex $res 0 0] "bar" {
       # Handle "bar" [suspend]s...
    } default {
       lappend res {*}$back
       return -code suspend $res
    }
 }

Note however that this records the state of the control structure when at the catch, so if it wants to react to being suspended (cf. Intervention below), it needs to modify that state. In particular, the local context of the barable proc (local variables etc.) would be recorded in the final element of the back list, so that format needs to be documented.

The problem with trace

(This subsection to be filled in.)

C API

(Not my strong side, but one that has to be addressed.) Since the idea is mostly to give new meaning to things that can already happen, there isn't that much that has to be added on the C side, but one major thing is the prototype for the function that is called when a command is resumed; let's call this a resumeProc:

typedef int Tcl_ObjResumeProc(
        ClientData      clientData,
        Tcl_Interp     *interp,
        int             objc,
        Tcl_Obj *const  objv,
        int             resumeIdx,
        Tcl_Obj *const  resumeData  );

The idea here is that the resumeData should be the objv from applying Tcl_ListObjGetElements to the result caught with the TCL_SUSPEND, and resumeIdx is the index into this array of the element which is relevant for this command; the resumeProc will eventually call the resumeProc of the command it received TCL_SUSPEND from with the same resumeData but resumeIdx–1.

For practical convenience, it is probably also useful to have a library function for calling the resumeProc of a particular command, so Tcl_EvalObjEx (or whichever is the core one these days) would get the cousin

int Tcl_ResumeObjEx(interp, objPtr, flags, resumeIdx, resumeData)

The tricky part is of course that all these resumeProcs must be provided when the command is created. For a long time I thought that this meant nothing of this could be implemented before Tcl 9 anyway (hence there would be no rush to generate a string representation :-] of the suspend/resume idea), but now I'm not so sure — it's perfectly reasonable to have e.g. Tcl_CreateObjCommand create commands without a resumeProc (resumeProc==NULL), and instead introduce a new function

Tcl_Command Tcl_CreateSuspendableCommand(
     interp, cmdName, proc, clientData, deleteProc, resumeProc
)

for creating commands that do have it. Calling Tcl_ResumeObjEx for a command that doesn't have a resumeProc won't work, but we can simply report that as an error, so it's no big deal (at the C level).

However, rather than reporting such errors at the resuming stage, it would be better to detect them at the suspend stage. In principle, that could be done by having Tcl_Eval* verify that any command returning a TCL_SUSPEND also has a resumeProc, or else convert that TCL_SUSPEND into an appropriate error. As far as I can tell, that would be the only potential incompatibility of this scheme against current Tcl.

Comparison of coroutine/yield and suspend/resume

Roughly, suspend corresponds to yield, whereas resume is more like the coroutine-cmds; there is no explicit coroutine-like command needed in the suspend/resume approach, since one is implicitly provided by the main event loop (when that is running).

Existence

Obviously, a big difference is that an implementation of coroutine/yield exists, whereas suspend/resume is at best a sketch.

Model

coroutine/yield is, as far as I can tell, modelled after the concept of a generator: a subroutine than can return several times. suspend/resume rather follows the continuation model: the rough idea of "the program from this point on" is given tangible existence. It is different from the LISPish call-cc in that the continuation is not given the form of an anonymous function, but that's natural given the more imperative nature of Tcl, and it is also different in that it is not the part of the program that calls suspend which gets to handle the continuation, but rather some generic handler at the top level.

Storage

Since suspend starts putting data in the interpreter result, everything has to be put under a Tcl_Obj wrapper. This is potentially a slow operation (but probably not so slow that it becomes a major issue when suspending the handler for an event). By contrast, coroutine/yield hides everything under the opaque handle of a command.

Giving continuations the form of a Tcl_Obj brings all the usual EIAS advantages: can pass by value and store in data structures, can save on file or hand over to another thread, one can even resume the same continuation several times (thus admitting a native implementation of oracles for nondeterministic automata)! The cost is that one actually has to implement (for every resumable command!) a relevant freeIntRepProc, dupIntRepProc, updateStringProc, and setFromAnyProc. This is just a Simple Matter Of Programming compared to coroutine/yield, since the potentially nontrivial task of isolating the internal state of the generator from the rest of the interpreter has already been done (introduction by NRE of execEnv) and all that remains is to serialize/deserialize this state. Designing a serialization format capable of expressing the necessary data structures is obviously possible. Designing it so that it can only express sensible instances of the necessary data structures is perhaps less easy, but this is not technically required for EIAS.

There is OTOH no principal problem in introducing explicit commands that serialize and deserialize a coroutine-command in the coroutine/yield approach (it just hasn't been done yet), and this could then be used to transport coroutines across thread boundaries, but there is probably a practical problem in that cooperation (serialization/deserialisation of clientData) would be required from anything that creates an NRE callback.

Emulation

coroutine and yield can be emulated using suspend/resume, as follows (rough implementation):

 interp alias {} yield {} suspend yield

 proc coroutine {cmd args} {
    if {[catch {{*}$args} res opt] != $::TCL_SUSPEND} then {
       return -options $opt $res
    }
    switch -- [lindex $res 0 0] "yield" {
       interp alias {} $cmd {} call-again $cmd $res
       return [lindex $res 0 1]
    }
 }

 proc call-again {cmd continuation value} {
    rename $cmd ""
    coroutine $cmd resume $continuation $value
 }

With the extended catch described above, that should be

 proc coroutine {cmd args} {
   if {[catch {{*}$args} res opt back] != $::TCL_SUSPEND} then {
      return -options $opt -level [expr {[dict get $opt -level]+1}] $res
   }
   switch -- [lindex $res 0 0] "yield" {
      # This kind of [suspend] is handled here.
      interp alias {} $cmd {} call-again $cmd $res
      return [lindex $res 0 1]
   } default {
      # Other kinds get to suspend also [coroutine],
      # backing up to the [catch] above.
      lappend res {*}$back
      return -code suspend $res
   }
 }

to allow coroutines to suspend themselves (and their callers) for other reasons than yielding.

I don't see how the converse would be possible (but since NRE is more than just coroutine, that needn't be a problem). What NEM has done on coroutines for event-based programming suffices for the motivating example, but suspend and resume are capable of more than that.

Intervention

Since a TCL_SUSPEND is a return code and can be caught, it allows surrounding control structures to intervene and/or react to being suspended. As I understand it, NRE provides nothing in this area.

As an example of intervention, consider control structures that exist not to do flow control, but to guard the utilization of some limited resource. One such case is the transaction subcommand of a TDBC handle:

 proc foo {db expr} {
   ...
   $db transaction {
      ...
      bar $expr $value
      ...
   }
   ...
 }
 proc bar {expr value} {
   ...
   if $expr then {set value [yield $value]}
   ...
 }
 coroutine baz foo $db $expr

Chances are that the [$db transaction] above will count as recursion with respect to the NRE and thus cause yield to error out, with a [$db rollback] as consequence, but if it does not then the database could stay locked for a rather long time, without a clear offender. By contrast, if the yield had been a suspend instead then [$db transaction] would be explicitly informed about what is going on, and thus given a chance to react sensibly. Doing an immediate [$db rollback] and changing the TCL_SUSPEND into a TCL_ERROR is the most elementary reaction (and perhaps the only one directly supported by the TDBC interfaces), but it might often be more practical to capture the operations made so far in some kind of diff, and attempt to reapply them when the transaction is resumed. In that case, the TCL_SUSPEND should be passed on.

A complication for control structures implemented in Tcl is that they need to create some representation of their own internal state when passing on a TCL_SUSPEND. If a third command besides suspend and resume is needed, then this will probably be why. On the other hand, if the serialization format used for the internal state of a proc is documented, then all sorts of programming tricks (e.g. disconnecting an upvared variable) can be implemented as suspending execution and having some detail in the state of some surrounding procedure modified…

Captured e-mail discussions

MS Thanks.

There is one problem with this approach: it breaks every extension that calls for an evaluation, as it doesn't know about TCL_SUSPEND nor what to do with it. Some of those are under our control (Tk for example), but unknown quantities are not.

A second (or is it the same?) problem is that extensions may well be using their own return codes ... a new return code is not something that we can do until Tcl9.

Lars H: Hadn't thought of that one, but OTOH I was mostly thinking Tcl9 anyway (until recently).

Realistically though, if some TCT member had come up with an idea that required usurping another return code, then I doubt there would have been much opposition against doing so within the 8.x series. Also, that noone else reported bug#647307 is perhaps an indication that nonstandard return codes aren't that heavily used. ;-)

MS A third problem with the special return code is that catch would block it; the current code can yield in arbitrarily nested catch ranges.

Lars H: That is not a bug; it is a feature (as explained above)!

MS The thing I implemented is 100% compatible, extensions that do not adapt keep on working but may block suspending. There is an api for extenders to adapt.

Lars H: No big difference there, I'd say:

As long as you don't use suspend (or emit the return code explicitly), everything works as before. Hence TCL_SUSPEND can be chosen so that all existing programs continue to work.
Extensions that call Tcl_Eval* would have an API that they could use to adapt to handle TCL_SUSPEND and resume.

MS This is also one of the reasons why you have to "declare" a coroutine: you have to prepare the infrastucture to be able to suspend a "thread" (?), which may or may not mean returning to the top level.

Your ideas would still fail to produce coroutines that can migrate accross threads, at least without big changes in the core: the problem is that the same interp cannot be run in different threads. The same problem as the ones in head.

Lars H: Thread migration would be done via serialization, as explained above. Reconstructing the state of a proc from string rep in another thread may seem like a hazardous task, but probably no more so than in the original interp. In particular, I suspect one must be able to handle changes in the bytecode for the procedure one is resuming, since redefenitions of other command may well have triggered a recompilation, can they not?

MS What I implemented relies on the NRE execution model: (newer, adapted) C code never calls an eval, it rather queues a callback with enough info do any post-processing or cleanup it needs to do after the evaluation, schedules the new evaluation and returns TCL_OK. There is a trampoline that runs the evaluation and then the callbacks. If a C function actually does call an eval using the regular API, trying to suspend before it returns will raise an error.

That means that no C function is ever suspended: it was implemented as a two-part function to begin with. Even though your email got truncated, I imagine this is not too different in spirit to the two-entry-point system you were about to describe?

Lars H: Slightly different in spirit, but perhaps not in nature. In order to make a typical resumeProc for Tcl_CreateSuspendableCommand, one should be able to just take the corresponding proc and add "spaghetti" for jumping to the place where resuming requires one to be — provided that one has the "stomach" for such violations of Structured Programming dogma. By contrast, the NRE API of callbacks one can add rather suggests that the proc should be transformed into an automaton. It is possible that any resumeProc obtained through adding spaghetti will also be possible to operate as an automaton through the Tcl_NRAddCallback interface, in which case the two would indeed not be different in nature, but I haven't explored the design in that much detail.

DKF: The whole "Structured Programming dogma" stuff was about constraining the form of the code that people wrote, not what it was mechanically transformed into. (Otherwise there'd be people very concerned about the fact that while loops are changed into something involving goto...) Hence I advise not worrying about the form of the code after transformation.

Of more concern when using your scheme (as opposed to Miguel's) is that it's not clear to me that it can support things like trace correctly. (For example, I've seen code that used a variable unset trace set on a caller's variable to determine when to destroy some other resource; sort of a RAII system on the cheap.)

Lars H: Indeed, that subsection of the "spec" is still ToBeWritten. The approach I think is reasonable is that unset etc. traces don't fire on suspend — the variable isn't really "gone", it has just dropped off the radar (as in Miguel's scheme) — but one might, for precisely this type of situation, need new trace operations to track suspends and resumes. Does Miguel's scheme currently allow this techinique to work, though; do unset traces fire when the execEnv they live in is deleted? I think it has already been established that it does not provide a way of reacting to yields.

DKF: Yes (MS removed bug description [L1 ], now fixed in HEAD)

% proc foo {} {set v 1; trace add variable v {write unset} bar; yield; set v 2; yield; set v 3}
% proc bar args {puts $args};namespace path tcl::unsupported;coroutine a foo
% a
v {} write
% a
v {} write
v {} unset
3
% a
ambiguous command name "a": after append apply array atProcExit auto_execok auto_import auto_load auto_load_index auto_qualify
% coroutine a foo
% a
v {} write
% rename a {}
v {} unset
%

MS: for this task you may also like atProcExit.

DKF: Only if you can both add extra things to the "at procedure exit" call or remove them. Without that you have problems if you decide that your resource really ought to not be deleted on exit. (Yeah, I ought to investigate properly...)

MS: (Nay, do not investigate ... help design) atProcExit does not allow that ... yet; maybe this weekend.

Zarutian 7. september 2008: I recall I once were fiddling with something similar to this but just implemented in pure Tcl without any change in the core or any libraries. Since are many a year. I only remember that I used traces on procedures heavily (mostly enterStep and leaveStep) to capture the continuation from the callstack which then is saved somewhere. When the continuation is continued then the callstack is built up again without side-effects (all called procedures in the continuation temporarly become memoized) and execution resumes. Sadly it could only save callstacks that only had calls to pure Tcl procedures. Where that (unfinished) code is I dont know.

DKF: It just occurred to me that with NRE in place it should be possible to do goto, since the main objection (the recursive nature of Tcl's evaluator) will have been squelched. OTOH, just because it is possible doesn't make it a good idea; I suspect that I'll never support putting goto in fully since it's always better to come up with a custom control structure that expresses what you really want to do.

Category Suggestions

Category Control Structure