Jim References

SS 18Mar2005:

This page is an introduction to the usage of the Jim references system.

Rationale

Why did Jim add references to the language? These are the main reasons, in decreasing order of importance:

Tcl already uses references. OOP systems often return names of objects, many libraries return handles. They are an idiomatic part of the language, to have support for references directly inside the language is useful.
Putting a reference system inside the language makes possible to implement Garbage Collection in a general way.
lambda is not very useful if it is not garbage collected.
Closures (and Jim closures) are almost useless without garbage collection.
The ability to build linked data structures sometimes can be very important in order to model real problems.

Design

Once I decided to add references to Jim, the question was in what form. My goal was to get a very simple semantic, don't violate in any way the Tcl semantic (i.e. be able to collect references mixed in strings after interpolation, shimmering, ...), allow for fast enough garbage collection, don't require extensions that need to take objects in private data structures to export objects in some way (and at the same time, making fragile references not an issue).

I'll not enter into implementation details, but in order to do so Jim is able to reach every object that is currently alive inside the script, even if this object is not reachable starting by the root nodes of the graph of the objects in a program. I.e. Jim is able to reach objects that are not reachable starting from globals and local variables. This feature is exposed at script level via the [debug objects] command.

How they work

References are just names with an associated string. Given the reference it's possible to obtain the associated string, or to set a new one in place of the old.

In order to create a new reference, the ref command is used. It has the following signature:

 ref string tag ?finalizer?

The string is the initial value associated to the reference to create. Can be any kind of string (or object, if you like more the implementation point of view). The tag will be explained in a moment. The finalizer is the name of a procedure that will be called just after the reference was destroyed with two arguments: the name of the reference destroyed, and its content.

An example of reference creation:

 . set r [ref "My little string" test]
 <reference.<test___>.00000000000000000001>

What is returned is indeed the name of the reference. References are all the same length, and can't contain spaces (this way it's possible to skip many kind of objects in the GC scan stage). As you can see the tag argument "test" appears in the reference name. And actually this is the only use for the tag command, it is a little description of the reference content. When debugging programs or in interactive testing it's handy to have a clue about what a reference may contain, or by its creator, just looking at his name. There is no semantical meaning in the tag, it's just a programmer help. The tag can't be longer than 7 characters in order to make references memory usage and creation/destroy time very little.

Now, the reference containing "My little string" is contained in the variable 'r'. Using getref it's possible to obtain the string a reference contains:

 . set r [ref "My little string" test]
 <reference.<test___>.00000000000000000002>
 . getref $r
 My little string

setref will instead set a new string for the specified reference:

 . setref $r "New String"
 New String
 . getref $r
 New String

That's it! Now... why is this special? You can already do it with a global array in Tcl. The unique feature of Jim references is indeed that they are garbage collected. Once a given reference is no longer found in any object, it is automatically destroyed and the finalizer, if there is one, called.

Garbage Collection and finalizers

Jim takes two counters:

The number of seconds elapsed since the last collection.
The number of references created since the last collection.

Every time a new reference is created, if at least one of these two counters is greater than a given value, a garbage collection cycle is performed. The user may also force garbage collection if working with huge strings, or just for testing, using the collect command.

Using the collect command we can try the effect of the finalizer.

First of all, a reference is created, setting as finalzer procedure myfinalizer:

 . set r [ref foobar test myfinalizer]
 <reference.<test___>.00000000000000000004>

We need to create this procedure of course. It just a one liner:

 . proc myfinalizer {x y} {puts "Finalizer called with $x,$y"}

Now, let's call collect to check what happens:

 . collect
 0

It returns zero and the finalizer is not called. Because there is nothing to collect, there is only one reference and it is referenced by the variable r. The collect command returns the number of references collected. So let's lose the reference we have by setting r to the null string, then try again to call collect:

 . set r {}
 . collect
 Finalizer called with <reference.<test___>.00000000000,foobar
 1

This time the reference was destroyed, and the finalizer called with its name, and the content, as arguments.

The implementation of lambda

Given a system like this, you can create a garbage collected lambda in Jim itself with a few lines of code:

 proc lambda {arglist args} {
     set name [ref {} function lambdaFinalizer]
     uplevel 1 [list proc $name $arglist {*}$args]
     return $name
 }
 proc lambdaFinalizer {name val} {
     rename $name {}
 }

Indeed this is not an example! This is the real implementation of lambda in Jim. This is how it works: it creates a reference with an empty string as content, using as tag "function", and LambdaFinalizer as finalizer procedure. Then a normal Tcl procedure having with name the reference is created, and its name returned. Done!

Once every reference to the procedure is lost, the finalizer will be called, and it will just use rename to delete the procedure. Note that this is possible because Jim does not scan variables or command names for references.

Jim references will also be the base of the Jim Object System (JOS).

Open issues

Currently Jim lacks two commands that will be implemented in short time. This are [info references] that will return all the references currently defined, and [finalizer] that will be used to query/modify the finalizer of an already created reference.

Steve Bennett 19 Oct 2010 The latest version of Jim now has the [info references] command to list all references, and the [finalize] command to query/modify the finalizer for a reference.

Space reserved for discussion

That's all for now, but please feel free to ask questions if something is not clear.

This is - a w e s o m e - I've pulled my hairs out trying to embed Vlerq, a project I'm working on, in Tcl. Found workable solutions in the end, but the above will make it sooo much simpler. Its requirements are simple: referring to objects not created in Tcl, for the sole purpose of holding on to them, passing them back as opaque handle, and cleaning up when the last reference to these objects goes away.

In Python, Ruby, and C++, this is trivial. In Tcl it's extremely awkward. In jim it might just work perfectly!

Thank you for creating a fascinating next-generation version of Tcl -jcw

SS Thank you for your comment!

jcw - A brief question about references: is Jim's GC like Boehm's GC? It seems to me that they use the same principle: traverse everything, assume a ref exists if a value has the right form & content (Jim of course has far fewer false positives).

SS Yes, it is the same principle but to an higher level. That's how it works: In Jim allocation of every Jim_Obj is handled per-interpreter, every Jim_Obj has two pointers more, when an object is created (either allocated from scratch or taken from the free list of objects), it is linked to a doubly linked list that has the head into the interp structure. So in every moment of the execution of a program there is a list that you can run and find every object that is currently in use.

This way it's possible to go through the list of objects, and search inside the object string representation if something looks like a reference. This is costly enough if there are a lot of objects, and requires to generate string representations for all the objects, so I tried to refine with the following changes: every objType structure adds a field, a flag that tells you if a given object can contain or not a valid reference. Because references are in the form <reference<....>00000 ...> integers, doubles, and other objects where a '<' char can't appear at all are not scanned for references, nor their string representations are generated. Notably even lists and dicts are not scanned, because references format was designed to don't contain spaces: this way I'm sure that a list (so a dict) can't contain references that the individual elements of this list don't contain: I scan all the objects, so I'm sure I'll encounter the elements composing the list. The next optimization is that references are long enough strings that many objects are not scanned because they have a string rep with length less than the length of a reference.

The final optimization is, of course, an object of type reference ;) in such a case the collector just takes the reference ID from the internal repr.

A key difference between the Boehm's GC and Jim's GC is that the first can start to collect from the root nodes of the graph, i.e. the stack, global vars, and so on. Jim can't... instead the whole space of objects is scanned, in order to be sure to reach even objects that are taken by extensions in some private data structure, without to require that this extensions provide some kind of way to "publish" the fact that they have private references to objects.

But this has a price... Jim is not able to collect circular references, so circular lists or complex graphs are not collected. Fortunately in the practice this kind of data structures are not very used. In the future will be probably possible to add a cycles collector that runs less often than the real one, but for now I think that it's much more valuable the fact that extensions don't have to care about the GC and still it works.

Sorry for the very long comment!

Setok This is actually quite interesting and has good potential to work in practise. There is one situation where, as far as I can tell, the garbage collector will give a false result. That is if, for one reason or other, the ref string would be split into two separate variables (to later be joined). I can't think of any actual real-life situation in which this would happen, but it's probably good for the user to keep in mind. Additionally passing references to other interpreters or over the net results in the same, as it would anywhere else really. It might be useful to have retain-release functionality for that.

AMG: If my understanding is correct, retain/release functionality already exists. Just make a list of references you want to retain, and they will be retained! Of course, if you lose the list, the references become garbage to be collected, even if your program has a way to reconstruct them that evades the GC's notice.

AMG, update: I am curious about something... above it's said that references are all the same length and contain no spaces in order to make it easy for the GC to skip many objects and focus on the references. If a list of references is created, even though the list isn't a reference and will be skipped, the contained objects will be scanned. But what if the list of references loses its list representation yet remains a valid list? The contained objects would go away, instead becoming nothing more than substrings of the container. For example, maybe some spaces get [append]ed to it. The list commands will still know what to do with it, but between the [append] and an [lindex], the GC won't notice any of the references contained within the reference list. Right? How does Jim handle this?

AMG, another update: I did a little test, and it seems Jim garbage collection no longer works the way described in this page. It's now more conservative about finalizing references.

. proc final {args} {puts [concat final $args]}
final
. set r >>[ref foobar test final]<<
>><reference.<test___>.00000000000000000008><<
. collect
0

Here, $r is definitely not a valid reference, but it contains a substring which is a valid reference. This is enough to prevent [collect] from calling [final].

Steve Bennett, well...I haven't read this page in a while. Substrings are definitely considered when checking for valid references. Nonetheless, as you point out above, it is possible to defeat the conservative garbage collector as shown below.

. set a [list [lambda a {}] [lambda b {}]]
<reference.<functio>.00000000000000000000> <reference.<functio>.00000000000000000002>
. [lindex $a 0] x
. regexp $a x
0
. collect
2
. [lindex $a 0] x
invalid command name "<reference.<functio>.00000000000000000000>"
. puts $a
<reference.<functio>.00000000000000000000> <reference.<functio>.00000000000000000002>

As you see, by forcing the list to be treated as a regular expression it is marked as not being able to contain references. Of course this code doesn't make any sense!

AMG: Here's the big question. Do you consider this to be a bug or a design constraint? As you say, the exploit code is nonsensical and not worth supporting, so I'm definitely willing to call this a constraint. But if that's the case, why is it worth supporting the substring case I show? It also doesn't make sense. Where do you draw the line?

Also, here's a simpler exploit. There's no need for [list]:

. set a [lambda {} {}]
<reference.<functio>.00000000000000000000>
. $a
. regexp $a x
0
. collect
1
. $a
invalid command name "<reference.<functio>.00000000000000000000>"

Steve Bennett: Hmm, hard to say. There are probably legitimate cases where a "container" becomes a pure string for a period of time (hence the substring search). The whole point of not searching every object is that we don't need to convert back to a string and search objects which can't contain references. It's a performance optimisation. Is it worth losing that (huge) optimisation to address these non-sensical cases? If are clever enough to find a scheme which addresses both, that would be idea.

AMG: I can't think of one. I'm working on a reference system, by the way, and I'm looking to Jim for inspiration. For now, I'm inclined to ignore substring references, counting only reference intreps and whole-string references. This loses out on tracking references embedded in list-/dict-like strings lacking the "right" intrep, so my idea handles fewer cases than Jim. My reference notation is harder to parse, and I only want references to count if they appear as a whole word, so that means checking a string for references basically requires converting it to a list of reference values. That's a tremendous cost, and it's unconscionable for the garbage collector to induce shimmering.

Then again, I do have an optimization which might make this list/reference conversion feasible. Rather than periodically collect garbage by scanning everything, I do reference counting, decrementing the referent object's refcount when the referring object is destroyed due to its refcount dropping below one. This means there's no need to scan all extant objects. This lets the interpreter spend a bit more time analyzing each object, since it will do this only once per object, at the end of each object's life. But that runs afoul of in-place modification of unshared objects... Well, crud. That potentially breaks even whole-string references, leaving only reference intreps.

By the way, for my system the "object" being referred to isn't a value, it's a variable.

Steve Bennett: Sure. I'll be interested to take a look at what you come up with.

AMG: Here's an idea. Add a flag to each object that indicates whether it could possibly contain a reference. Only scan objects marked with this flag. (Maybe also maintain a list of pointers to objects known to be flagged so as to further optimize scans. This list doesn't need to be ordered, and the key is the value; in other words, it's a set and can be efficiently implemented as a value-less hash table.)

The return value of [ref] is a flagged object. An object that involves the concatenation of a flagged object (such as [ref]'s return value) is likewise flagged. A list or dict containing such an object is not flagged, but if it shimmers to string with no list or dict intrep, its string value was produced via concatenation and therefore gets flagged. If it later shimmers back to list or dict, the flag is cleared, but one of the element objects will have the flag set.

If unshared, the string could have lost its reference element due to an in-place modification. This would not be noticed until it shimmers to list or dict, but there's no real harm in an object being flagged as possibly having a reference if it doesn't actually have a reference. But watch out! An in-place modification could add a reference to a string. To handle this, there would need to be a function that extension code could call to check for references in a string or substring that it's modifying, then flag the object if any potentials are found. What a chore...

Now you will never skip objects based on type, and you will minimize the need to scan for substrings.

There's one drawback. If a reference object is copied outside of normal channels, the copy won't have its reference flag set. Here's a (theoretical) example:

. ref foobar test
<reference.<test___>.00000000000000000000>
. set r <reference.<test___>.00000000000000000000>
<reference.<test___>.00000000000000000000>
. collect
1
. getref $r
invalid reference id "<reference.<test___>.00000000000000000000>"

In this example, the object is copied by means of the user reading the screen and typing it back in again.

To handle such a case (if you want to), it would be necessary to also scan newly created string objects for possible references. I'm not sure you're going to want to do that with strings coming from files and network sockets, even though a pathological program could use such things to send itself references, bypassing the flag mechanism described above. Anyway, this check probably shouldn't care if the references name currently extant reference values, only if the reference signature is observed. This is done to avoid trouble with a string being manually set to a predictable reference value before creating that reference.

AMG: I have found more cases to handle, which I document in my paper to be presented at Tcl2012. For now I'm adopting a garbage collection scheme that is a hybrid of Jim, Tcl-style reference counting, and mark-and-sweep, but with infrequent, targeted execution and sharply limited scope so as to hopefully obtain good performance for non-pathological programs. At this point it's just a proposal, so I have no metrics to share.

Category Jim