Version 18 of Jim References

Updated 2006-12-05 13:31:44

SS 18Mar2005:

This page is an introduction to the usage of the Jim references system.

Rationale

Why did Jim add references to the language? These are the main reasons, in decreasing order of importance:

  • Tcl already uses references. OOP systems often return names of objects, many libraries return handles. They are an idiomatic part of the language, to have support for references directly inside the language is useful.
  • Putting a reference system inside the language makes possible to implement Garbage Collection in a general way.
  • lambda is not very useful if it is not garbage collected.
  • Closures (and Jim closures) are almost useless without garbage collection.
  • The ability to build linked data structures sometimes can be very important in order to model real problems.

Design

Once I decided to add references to Jim, the question was in what form. My goal was to get a very simple semantic, don't violate in any way the Tcl semantic (i.e. be able to collect references mixed in strings after interpolation, shimmering, ...), allow for fast enough garbage collection, don't require extensions that need to take objects in private data structures to export objects in some way (and at the same time, making fragile references not an issue).

I'll not enter into implementation details, but in order to do so Jim is able to reach every object that is currently alive inside the script, even if this object is not reachable starting by the root nodes of the graph of the objects in a program. I.e. Jim is able to reach objects that are not reachable starting from globals and local variables. This feature is exposed at script level via the [debug objects] command.

How they work

References are just names with an associated string. Given the reference it's possible to obtain the associated string, or to set a new one in place of the old.

In order to create a new reference, the ref command is used. It has the following signature:

 ref string tag ?finalizer?

The string is the initial value associated to the reference to create. Can be any kind of string (or object, if you like more the implementation point of view). The tag will be explained in a moment. The finalizer is the name of a procedure that will be called just after the reference was destroyed with two arguments: the name of the reference destroyed, and its content.

An example of reference creation:

 . set r [ref "My little string" test]
 <reference.<test___>.00000000000000000001>

What is returned is indeed the name of the reference. References are all the same length, and can't contain spaces (this way it's possible to skip many kind of objects in the GC scan stage). As you can see the tag argument "test" appears in the reference name. And actually this is the only use for the tag command, it is a little description of the reference content. When debugging programs or in interactive testing it's handy to have a clue about what a reference may contain, or by its creator, just looking at his name. There is no semantical meaning in the tag, it's just a programmer help. The tag can't be longer than 7 characters in order to make references memory usage and creation/destroy time very little.

Now, the reference containing "My little string" is contained in the variable 'r'. Using getref it's possible to obtain the string a reference contains:

 . set r [ref "My little string" test]
 <reference.<test___>.00000000000000000002>
 . getref $r
 My little string

setref will instead set a new string for the specified reference:

 . setref $r "New String"
 New String
 . getref $r
 New String

That's it! Now... why is this special? You can already do it with a global array in Tcl. The unique feature of Jim references is indeed that they are garbage collected. Once a given reference is no longer found in any object, it is automatically destroyed and the finalizer, if there is one, called.

Garbage Collection and finalizers

Jim takes two counters:

  • The number of seconds elapsed since the last collection.
  • The number of references created since the last collection.

Every time a new reference is created, if at least one of these two counters is greater than a given value, a garbage collection cycle is performed. The user may also force garbage collection if working with huge strings, or just for testing, using the collect command.

Using the collect command we can try the effect of the finalizer.

First of all, a reference is created, setting as finalzer procedure myfinalizer:

 . set r [ref foobar test myfinalizer]
 <reference.<test___>.00000000000000000004>

We need to create this procedure of course. It just a one liner:

 . proc myfinalizer {x y} {puts "Finalizer called with $x,$y"}

Now, let's call collect to check what happens:

 . collect
 0

It returns zero and the finalizer is not called. Because there is nothing to collect, there is only one reference and it is referenced by the variable r. The collect command returns the number of references collected. So let's lose the reference we have by setting r to the null string, then try again to call collect:

 . set r {}
 . collect
 Finalizer called with <reference.<test___>.00000000000,foobar
 1

This time the reference was destroyed, and the finalizer called with its name, and the content, as arguments.

The implementation of lambda

Given a system like this, you can create a garbage collected lambda in Jim itself with a few lines of code:

 proc lambda {arglist args} {
     set name [ref {} function lambdaFinalizer]
     uplevel 1 [list proc $name $arglist {*}$args]
     return $name
 }
 proc lambdaFinalizer {name val} {
     rename $name {}
 }

Indeed this is not an example! This is the real implementation of lambda in Jim. This is how it works: it creates a reference with an empty string as content, using as tag "function", and LambdaFinalizer as finalizer procedure. Then a normal Tcl procedure having with name the reference is created, and its name returned. Done!

Once every reference to the procedure is lost, the finalizer will be called, and it will just use rename to delete the procedure. Note that this is possible because Jim does not scan variables or command names for references.

Jim references will also be the base of the Jim Object System (JOS).

Open issues

Currently Jim lacks two commands that will be implemented in short time. This are [info references] that will return all the references currently defined, and [finalizer] that will be used to query/modify the finalizer of an already created reference.

Space reserved for discussion

That's all for now, but please feel free to ask questions if something is not clear.


This is - a w e s o m e - I've pulled my hairs out trying to embed Vlerq, a project I'm working on, in Tcl. Found workable solutions in the end, but the above will make it sooo much simpler. Its requirements are simple: referring to objects not created in Tcl, for the sole purpose of holding on to them, passing them back as opaque handle, and cleaning up when the last reference to these objects goes away.

In Python, Ruby, and C++, this is trivial. In Tcl it's extremely awkward. In jim it might just work perfectly!

Thank you for creating a fascinating next-generation version of Tcl -jcw

SS Thank you for your comment!

jcw - A brief question about references: is Jim's GC like Boehm's GC? It seems to me that they use the same principle: traverse everything, assume a ref exists if a value has the right form & content (Jim of course has far fewer false positives).

SS Yes, it is the same principle but to an higher level. That's how it works: In Jim allocation of every Jim_Obj is handled per-interpreter, every Jim_Obj has two pointers more, when an object is created (either allocated from scratch or taken from the free list of objects), it is linked to a doubly linked list that has the head into the interp structure. So in every moment of the execution of a program there is a list that you can run and find every object that is currently in use.

This way it's possible to go through the list of objects, and search inside the object string representation if something looks like a reference. This is costly enough if there are a lot of objects, and requires to generate string representations for all the objects, so I tried to refine with the following changes: every objType structure adds a field, a flag that tells you if a given object can contain or not a valid reference. Because references are in the form <reference<....>00000 ...> integers, doubles, and other objects where a '<' char can't appear at all are not scanned for references, nor their string representations are generated. Notably even lists and dicts are not scanned, because references format was designed to don't contain spaces: this way I'm sure that a list (so a dict) can't contain references that the individual elements of this list don't contain: I scan all the objects, so I'm sure I'll encounter the elements composing the list. The next optimization is that references are long enough strings that many objects are not scanned because they have a string rep with length less than the length of a reference.

The final optimization is, of course, an object of type reference ;) in such a case the collector just takes the refernece ID from the internal repr.

A key difference between the Boehm's GC and Jim's GC is that the first can start to collect from the root nodes of the graph, i.e. the stack, global vars, and so on. Jim can't... instead the whole space of objects is scanned, in order to be sure to reach even objects that are taken by extensions in some private data structure, without to require that this extensions provide some kind of way to "publish" the fact that they have private references to objects.

But this has a price... Jim is not able to collect circular references, so circular lists or complex graphs are not collected. Fortunately in the pratice this kind of data structures are not very used. In the future will be probably possible to add a cycles collector that runs less often than the real one, but for now I think that it's much more valuable the fact that extensions don't have to care about the GC and still it works.

Sorry for the very long comment!

Setok This is actually quite interesting and has good potential to work in practise. There is one situation where, as far as I can tell, the garbage collector will give a false result. That is if, for one reason or other, the ref string would be split into two separate variables (to later be joined). I can't think of any actual real-life situation in which this would happen, but it's probably good for the user to keep in mind. Additionally passing references to other interpreters or over the net results in the same, as it would anywhere else really. It might be useful to have retain-release functionality for that.


Category Jim