Tcl9 and annotated strings

KBK writes: Donal Fellows and I had the following exchange in comp.lang.tcl [L1 ]:

Me:

I have been thinking a lot about instance management in Tcl in recent months; alas, my thoughts are definitely not ready to convert into TIPs.

Donal:

Care to put any of that thought into words? I've been thinking about this for quite a while too, and I have a feeling that if you could somehow attach to strings a notion of what objects are "contained" within that string, you could do something reasonably smart with operations like string-concatenation and [eval].

I guess that I owe him a few words about where my thoughts have led me. Note that these thoughts are solely my own; I am not in any way speaking for the TCT.


Why is instance management a problem?

The trouble with the internal representations that attach to Tcl_Obj values is that they're too easy to lose. If we simply use the reference count of the Tcl_Obj to manage whether a reference is "live", we run into several problems that cause us to collect garbage prematurely.

1. shimmering. The interpreter may find it necessary to convert the object in question to another internal representation. This problem is actually not a major cause of premature garbage collection in systems like Jacl and TclBlend. The typical string representation in a Tcl_Obj that's used as an object instance is something that can't be mistaken for, nor converted to, anything else. Very occasionally, one can get bitten by the duality between strings and lists, causing the internal representation of the object to be converted to a Tcl list. This problem could be avoided for handles and for opaque objects by choosing a string representation that is not a well-formed list, for example one that contains an unbalanced brace.

2. Interpolation into strings. This issue is a much commoner cause of premature garbage collection. There are a number of reasons to embed an object reference into a string. One common one is to create callbacks:

    set foo [createAFunkyObject]; # foo's internal representation contains
                                  # an object pointer
    button .b -text "Funky" -command "$foo doSomething"

When the variable foo goes out of scope, it is likely that the callback string of .b will be the only reference to the object: since that string has only a string representation, there is no longer a reference to the object. The object immediately becomes a candidate for garbage collection. When the button is invoked, the object no longer exists.

This problem can be avoided in some cases by creating callbacks that are pure lists. The interpreter has special code that bypasses the parser when a pure list is evaluated, so the Tcl_Obj containing the object in question can be kept alive. Alas, this scheme all to soon breaks down, because not every possible callback can be represented in that way. Moreover, if garbage-collected object systems, become popular, this technique will be well-nigh impossible to explain to newcomers. Furthermore, even under the best of circumstances, it requires the intepreter to maintain a global cache mapping object names to object values, so that an object name that has been interpolated into a string by means of a command like:

    bind $w <Button-1> [list doSomething $foo %x %y]

can have its internal representation recovered.


The string representation is stupid.

As Donal points out [L2 ]:

I have a feeling that if you could somehow attach to strings a notion of what objects are "contained" within that string, you could do something reasonably smart with operations like string-concatenation and [eval].

The problem is that the Tcl_Obj has only one idea of a string representation: a pointer to a UTF-8 character string. The notion that "everything is a string" has served us well for as long as Tcl has existed. Nevertheless, its elegant simplicity comes at a price: we have no way of attaching additional information to substrings of a string.


A smarter string representation.

For nearly as long as Tcl has existed, I (KBK) have wanted a more "intelligent" string representation. This desire originally came less from wanting to preserve object information and more from wanting to achieve better performance on common operations such as concatenation and substring extraction. All too often, the current implementation requires that strings be copied in memory.

I contend that the current concept of "string representation" is too simple to address both instance management and performance on large strings. In fact, even though the text widget in Tk conceptually operates on a single large string of characters, from the very beginning it has used an internal representation of that string based on B-trees. Tcl strings with embedded object references need something similar.

This discussion reminds me of a text editor that I helped develop in the late 1970's. In the editor, a file was represented as an B-tree of lines. Each node in the tree contained counts of lines and characters in its subtree; nodes could also be paged out to external memory (remember that typical systems of the time had at most 64k bytes of RAM!). Nodes could also be annotated: a node could be labeled with additional data. One thing that the annotations were used for was script processing: there was a little language that controlled the editor. When source code for an editor macro was processed, the B-tree representing the macro string was annotated with file name and line number; this structure was used to produce error messages and stack traces.

Nowadays, data structures have advanced a little bit, and memory is cheap; paging to external storage is not a major concern. Still, representing Tcl strings with some sort of search tree with order statistics is a tempting idea. Not only would it solve the perennial problem of locating the source line after an error in a Tcl script, but it also provides an ideal framework for keeping track of Donal's notion of what objects are "contained" within the string.


How do we get there from here?

There are many details that need to be worked out; this idea must be a candidate for Tcl 9 because it will unquestionably break extensions. One possible roadmap might be:

  • Augment Tcl_Obj with another pointer, to the head of the tree that holds its annotated string representation.
  • Modify Tcl_GetStringFromObj, Tcl_SetStringObj, and so on to use this tree in preference to the string stored in bytes. Note that Tcl_GetStringFromObj will still need to maintain bytes so that it can be freed; callers to this function expect the string to be stored in Tcl's internal memory.
  • Modify string operations like [string] and [append] to work on the tree in preference to bytes. This ought to buy a fair amount of performance on large strings because it will save memory copying.

And perhaps even:

  • Modify the source to annotate each line of the source text with file and line number. This technique is much more powerful than the one suggested in TIP #86 [L3 ]; it allows for correct line number tracing even in the presence of constructs such as [eval], [uplevel], [namespace code] and [proc]s that define other [proc]s.

The existing API to strings, both in Tcl and in C, is surprisingly sparse. Having a third representation of "tagged strings" looks as if it might be a feasible project for Tcl 9. (It should be considered, though, together with the issues that I plan to present in Tcl9 and abstract interfaces -- which I don't have time to write at this moment. Stay tuned.)


RS: Sounds interesting. Once annotated strings make it robustly into the Tcl core, they might also replace the text-specific Btree... And: do you think it's possible to expose the annotation to script level?


FPX: This is interesting. In my understanding, this feature is about objects that are more than their string representation, e.g. opaque data structures at the C level that you want to represent by a pointer value. One important keyword in this context is Feather, which would do the same with code -- its lambdas have information (namely, their implementation) that is not expressed by their stringified representation.


AK: A similar thing I saw somewhere bandied around is the notation of a concat-string-object. Essentially an object maintaining a list of objects object (like a list), but the string is calculated by direct concatenation and not by [join]. One interesting application would be when it comes to variable interpolation in an eval'd string, for example [eval "$cmd foo bar"] because it should allow us to extend the existing optimization of eval for pure lists to more general strings as well, without losing the internal representation of "cmd".


male - 2007-05-16: This sounds really interesting, but ... when I think on the application I work on, an old application coming from the tcl 7.4 times with many constructs really working with the "everything is a string" concept, than I'm a bit afraid, that all this mostly pure string processing would increase the memory consumption a lot.

Why I'm afraid? The C(++) API to tcl is old and completely based on the old Tcl_CreateCommand interface. So every communication with C++ sided functionality will break the Tcl_Obj concept and will work on strings.

Is it possible to estimate the "new" overhead, if a plain character based string would be changed in a B-tree or annotated string representation? I mean the runtime overhead in processing and the memory overhead!

I would be cautious to ask for an enhancement of the string handling, because tcl is currently very quick and does not consume too much memory working with a lot of strings. But what will be the consequence of introducing a new much more complex string representation?

Best regards,

Martin


FB: It seems that ropes perfectly fit the above needs. Tcl strings represented as ropes would be B-trees whose leaves point to portions of immutable strings. Not only would potentially expensive operations be cheaper, but one could associate metadata with the source strings. For example, a source command would load a string into memory with information about the file associated as metadata, and the ropes pointing to this source string (lists, procs) would access this metadata to generate meaningful messages. One could also mix different representations of strings in one single rope, for example portions in UTF-8 or 16-bit Unicode byte arrays, eliminating shimmering (byte arrays are internal reps). Utility procs would simplify iteration and random access. See Cloverfield, section Data structures.