linked lists

[RMS faults Tcl for its linked-lists-lessness. Explain what linked lists are, who RMS is, and Tcl perspective--mostly that Tcl doesn't need 'em.]

fut: Thanks to copy-on-write, linked lists can be implemented trivially in Tcl. RMS's point, I think, is that Tcl lacks references. See cons.

Lars H: Well, he also considered a version of Tcl way older than Tcl 8.0, which is where we got Tcl_Objs and sharing of values; algorithmic complexities were a bit different back then, and that could also be a reason to reject Tcl's list. But yes, RMS is such a 1980's LISPer that he probably wouldn't acknowledge anything not built on references as "lists".''

Arjen Markus: I do not know who RMS is, but "woe, ye of little imagination," Tcl lists can be used as linked lists - if you regard a linked list as some abstract data structure (apart from the implementation!). I regard his/her statement as an example that most people can not distinguish implementation and concept!

AMG: Don't mean to interrupt, but this reminds me of [1 ].

RS: RMS is Richard M Stallman, a famous programmer in the GNU scene, but not enthusiastic about Tcl. (See Who says Tcl sucks...) I also have problems understanding the need for linked lists - Tcl's are contiguous in memory, so the benefit of links, i.e. iterating over it, is easily reached with foreach. OTOH, recursing over lists involves copying in Tcl, so real linked lists like LISP's CAR/CDR also have their advantages.

Mike Tuxford notes that no one mentioned that RMS is also the author of emacs, besides being in the news a lot defending the OSS model.

CL mentions that, true as this is, the 'O' in OSS is anathema to RMS, last time I checked.

LV: I'd say the issue is likely this - people used to algorithmic solutions requiring specific data structures, such as structures, linked lists, etc. find it difficult to envision how to efficiently implement such solutions in Tcl with only flat lists and hash tables.

davidw: the real problem is not linked lists in and of themselves, but the lack of references. Linked lists aren't really all that necessary in Tcl for the simple case, but references allow you to easily construct more complex data structures.

Arjen Markus: Can this not be achieved in a very similar way as in Scheme (the LISP dialect I am gradually learning because of the SICP book)? For instance:

set complex_data_struct {A {B C {D E {F G}}}

Such a structure (lists within lists) allows one to define recursive lists as well:

set recursive_list {A {B {C {D {E {F {G}}}}}}

davidw: maybe it's visually similar, but what's going on underneath is not the same. Have a look at this scheme code, and try and accomplish the same thing in Tcl:

> (define a (list 1 2))
> (define b (list 3 4))
> (define l1 (list a b))
> l1
((1 2) (3 4))
> (set-car! b 30)
> l1
((1 2) (30 4))

As far as I know, it's not possible. Everything gets turned into a string at some point, so references don't get preserved.

PYK 2014-03-1: Right, you wouldn't do that in Tcl, you would do this:

set a 0 
set b 1 
set l1 {{1 2} {3 4}} 
lset l1 $b end 30
puts $l1

Granted, lset wasn't around when this conversation first happend.

Stephen Trier: Back in the dark ages, I wrote LISP-style linked list routines in BASIC that used arrays. The internal pointers were array indexes. One could do the same in Tcl with lset or an array.


Arjen Markus: It was a bit of a challenge, but here it is - with Tcl lists!

#
# Simple data references
#
set ::referenced_data {}

proc dataref { data } {
    set idx [lsearch $::referenced_data $data ]
    if { $idx < 0 } {
        set idx [llength $::referenced_data]
        lappend ::referenced_data $data
    }
    return $idx
}

proc getdata { ref } {
    lindex $::referenced_data $ref
}

proc replacedata { ref new } {
    set ::referenced_data [lreplace $::referenced_data $ref $ref $new]
}

proc reflist { a b } {
    dataref [list "REF" $a $b]
}

proc set-car! { aref b } {
    set adata   [getdata $aref]
    set refcdr  [lindex $adata 2]
    set newdata [list "REF" $b $refcdr]
    replacedata $aref $newdata
}

proc print-reflist { aref } {
    set print_data {}
    set adata [getdata $aref]
    if { [lindex $adata 0] == "REF" } {
        foreach {dummy car cdr} $adata {
            append print_data "( " [print-reflist $car] " " \
                [print-reflist $cdr] " )"
        }
    } else {
        append print_data [getdata $aref]
    }
    return $print_data
}

set ref1 [dataref "A"]
set ref2 [dataref "B"]
set ref3 [dataref "C"]
set ref4 [dataref "D"]

set alist [reflist $ref1  $ref2 ]
set blist [reflist $ref3  $ref4 ]
set clist [reflist $alist $blist]

puts $alist
puts $blist
puts $clist
puts [print-reflist $alist]
puts [print-reflist $blist]
puts [print-reflist $clist]

set-car! $alist $ref3
puts $clist
puts [print-reflist $clist]

Note that the whole trick is that the data in the list are either tagged and hence identifiable as a reference (or better a pair of references) or not, in which case they are ordinary data. References are regarded as ordinary data and stored in the global list that holds all information.

It can undoubtedly be done more elegantly, but that is not the point: I have proven (I hope) the concept of referenced data in pure Tcl.

davidw: Now add garbage collection to it.

SS: see TCL references in TCL for a pure TCL implementation of references and garbage collection.

Arjen Markus: You are not satisfied quickly, are you? :-) Well, the whole model must be elaborated a bit, as now it will work only for examples structured in the same way as the original - pairs of pairs of elements and the like. But all references are stored as data, so when a reference variable gets set or unset, you may register this event and do a bit of cleaning up.

davidw: No, because I know I'm right;-) Basically what you are doing, in a clever way, is building your own memory system on top of everything else, which is neat, but the point of a higher-level language is that memory management gets done for you. You shouldn't have to build a whole new system. I think that feather will resolve some of these problems.

Arjen Markus: Yes, I realise that it would be a lot of work, I had thought of a quick implementation like:

proc GC { } {
    if { [day_of_week] != "wednesday" } {
        puts "No garbage collected today!"
    } else {
        puts "We are out on an extended lunchbreak"
    }
}

but that is just a feeble attempt to a joke. Still, if feather might solve that, why not try and revive that? I think I heard from Paul Duffin that it never came past a bunch of ideas.

Salvatore Sanfilippo: I like TCL for many reasons, but I must admit that the fact I can't use references to create complex data structures like a graph is for me the biggest TCL problem... Of course it is already possible to create non-trivial data structures with TCL, but the lacks of references makes this structures not good for almost all the tasks where complex data structures are really useful. Most real-world problems that fits well in TCL does not require references, also the fact that TCL implements dictionaries, that are a so useful data structure that many problems can be solved using a dictionary without a too big time complexity, makes the lacks of references not so evident. But it exists... and can do the difference between to solve a problem in O(n), or O(n^2).

Some option to implement this: (warning, bad english, maybe NOT well exposed concepts).

1) opaque implementation using structures build of slots and the dual-representation of objects.

I feel strange there isn't a major effort to stop this problem. From what I can see, one of the possible ways to fix this is to use the (not so) new feature of the TCL core to have a dual representation of objects, this way references can be valid until the interpreter is sure they are valid references, so any attempt to convert it to strings will make they just strings (that will no longer work with the command able to get an object from a reference). The references can be implemented so that they will work with a new kind of data structure that is able to contain slots with name. Every slot can contain a reference, or any other kind of object. This structures build of slot with names containing references to other structures are enough to build all the kind of complex data structures, but at the same time they don't mess with all the rest of TCL. For sure a solution where references are 'free' like any other object, able to be converted to strings and then back to references, seems much more 'interesting' and better designed, but the way TCL works makes me quite sure that it's better to make this subsystem something like a new layer of abstraction that is able to live with the rest, rather then a new fully-featured data type.

DKF: This is indeed an obvious way to do it. This method splits into two subkinds: named and unnamed handles.

With a named handle, you have a global/per-interp hashtable that maps the string representation of the object to the pointer representation, and deletion must always be explicit (there's typically also an introspective mechanism for listing the contents of the hash). Tcl's channels work exactly this way, as does itcl. If anything is done with the handle objects themselves, it is just a cache of what could be simply discovered. As a technique, it tends to leak resources without careful (paranoid?) programming.

With an unnamed handle, deletion of the underlying resource is closely coupled to deletion of the internal representation of the handle object. This cleans things up automatically, but is a bit inclined to be fragile. It's quite possible to deploy real applications that use this though: I've done that.

DKF-end

SS: My idea was a bit different, but I exposed it not very well... I'll retry, both the proposed implementations are assuming garbage collection, but the first makes references a quality of a new data structure, that's a Self-like object composed of many slots with names, where every slot can only contain a TCL object, or a reference to another of this composed objects. That's in some way to isolate the reference system from the rest of TCL, with commands able to handle references and this new structures. For example the 'struct' command may create a new object of this kind, starting from a TCL list containing the names of the slots:

set foo [structure {x y z}]

Other commands like set-slot, get-slot and so will assign normal TCL objects (in theory just strings) to slots, while set-ref-slot, and get-ref-slot will be able to set/get objects by reference. BTW, after some day of speculations I tend to think this way to implement references in TCL is simply not good (in favor of the second 'direct' way to do it).

SS-end

2) the 'direct' way.

Another trivial idea, that is probably the first that comes in mind thinking at the subject "tcl and references" is to directly expose the C object pointer in a TCL string, something like REF:0xff112233. The classical problem with this implementation is that a buggy program can mess with the string, so that the TCL interpreter will reference a random piece of memory. Can't be this fixed using an hash table with all the valid pointers? Of course this hash table should be in sync with the garbage collector, that will remove the addresses of no-longer referenced objects for the table. Apart from the design point of view (it is not so fun to have pointers directly exposed), what's the problem with this very simple implementation?.

DKF: If you are maintaining such a table, you're just doing some variation on the above two methods: in general, you're free to choose any handle format you want, but letter-prefix and unique-id-suffix is common. If we consider Tcl's channels again, the numeric part of such handles is actually derived in this sort of way (on Unix, it is the file descriptor, and I don't understand the other platforms well enough.)

Or am I getting the wrong end of the stick?

DKF-end

SS: Yep... but there is an hard problem: attackers. Using this design, the garbage collector have to scan every single Tcl string searching for encoded pointers. An attacker may try to guess valid addresses, in an encoding like PTR:0xaabbccdd that's not hard, so something a bit more robust is needed, like PTR:0xaabbccdd:<64bit-random-number-generated-at-startup>. This ensure the attacker will not pass fake pointers using files, sockets, stdin, and alike.

The parallel with TCL channels is very interesting, while of course channels are not garbage collected, that opens a lot of new issues. BTW, I changed my mind, and now I think the direct exposure of pointers in TCL can be a good idea while it is almost always a bad idea in other languages, for the fact that in TCL all is string, and the language is _very_ orthogonal, so simply it is hard to think that the inability to put pointers inside a TCL list is good design. That's why the opaque design no longer get me... it is even more complex to understand because of the non-orthogonality.

To come back to channels's parallel, like the kernel doesn't expose its file handler in user space, but just maps an integer to its internal stuff, TCL may do the same, so instead to have PTR:0xaabbccdd we may have PTR:0, PTR:1 and so on. This way the table isn't just there to tell us if an address is valid or not, but can be used to translate some sort of object id into its address.

SS-end

These are just random ideas of implementation, for sure there are better, but my point is that Tcl is so good that the Tcl's community should try to find a solution about this problem, to make it even better. It is something not related to slowness, but directly related to the inability to use all the algorithms yon can find in a computer science book with Tcl.

Also I'll be glad if someone can point me to the wiki page about the implementation of references in Tcl (if any). If you really read all the trash I wrote this night, thanks, and of course if you'll create some kind of group to discuss this specific problem I'll be glad to take part, for what I can help.


RHS wonders, would it be possible to define a type of weak reference at the C level that can be used at the Tcl level. For example, one could say

set myList {a b {c d} e f {g {h i j} k} l}
reference mySubList myList [lindex [lindex $myList 5] 0]

This would create a "weak reference" to the {h i j} element of the myList variable named mySubList. When someone sets mySubList, it changes the value in both variables at once.

set mySubList {z y x}
set mySubList ;# -> {z y x}
set myList    ;# -> {a b {c d} e f {g {z y x} k} l}

When the reference count for the head object, or any sub object above the sublist, is set to 0, then mySubList becomes a strong reference (a normal tcl variable).

info variableType mySubList ;# -> weak (a fake introspection command, so I can show what I mean)
set mySubList {a new list}
info variableType mySubList ;# -> strong

When garbage collecting, it would mean a list object would need to do a little more work:

  1. If the object is not a list, clean it up as normal
  2. If the object is a list, check to see if there are any weak references to the object at the top level (the level we're looking at). If there is, make one of them a strong reference, chance all the other weak references to point to the new strong reference, and clean up the original as normal.
  3. For each child in the list, repeat the process and perform the same checks/changes

I'm not sure this makes all that much sense, and I'm positive its not a complete description of how one would go about implementing the idea. However, it occurred to me while reading the page, and I thought it would be a good idea to at least put down in writing what my thoughts were.

Lars H: This violates the immutability of values. Right now you can do

set myList {a b {c d} e f {g {h i j} k} l}
set safeCopyOfMyList $myList

and the value in safeCopyOfMyList will stay the same no matter what you do with myList. Not so with your suggestion. Since the Tcl_Obj is shared between the two variables, a reference mySubList myList lindex [lindex $myList 5 0]

set mySubList {z y x}

would change also $safeCopyOfMyList. The consequences if a package (e.g. http) would start to make use of this technique could be disastrous, since any value handed over to that package might start mutating behind your back.

The postulate that values are immutable is heavily relied upon in Tcl. See 32-bit integer overflow for some of the chaos that can occur when this is violated (in that case because of a bug).

RHS: I don't know that that's necessarily the case.

set myList {a b {c d} e f {g {h i j} k} l}
set safeCopyOfMyList $myList

The refcount of the particular object is now 2

reference mySubList myList [lindex [lindex $myList 5] 0]

Ok, now mySubList points to the object that is a sub-element of myList, with safeCopyOfMyList also pointing at the bigger list

set mySubList {z y x}

As per normal copy-on-write rules, safeCopyOfMyList now has its own copy of the list. Both copies of the list have a refcount of 1, and the subelement pointed to by mySubList has a weakrefcount of 1 (all other objects involved have a weakrefcount of 0).

Lars H: The big list has a refcount of 2, but the {g {h i j} k} list only has a refcount of 1, and the same holds for the mutable {h i j} element. When modifying mySubList, set has no way of knowing that the weakrefcounted object is part of a shared object. The only place you can unshare the two is at the reference command, but I don't think you can get it to make sense. The entire suggestion is based on "weak" references violating copy-on-write rules.

It totally breaks down when you do

set mySubList $myList
string length $myList

Since $myList then contains itself, conversion to string will cause infinite recursion.

RHS: [snip a bunch of stuff I had written, because it came out wrong] What I envision is the following:

% set myList {a b {c d} {e {f g} h} i}
% reference mySubList myList [lindex $myList 3 1]
% set mySubList
f g
% set myList
{a b {c d} {e {f g} h} i}
% set mySubList $myList
% set mySubList
{a b {c d} {e {f g} h} i}
% set myList
{a b {c d} {e {a b {c d} {e {f g} h} i} h} i}

I'm not quite sure what the rules would be on getting it to work like this, but that's what I had in mind.

The basic idea for it all is to allow someone to create data structures using lists, and be able to operate on various nodes in the data structure. For example:

% set myTree { {a 1} {b 2} {d {{e 3} {f 4}}} }
% reference myLeaf myTree [lindex $myTree 2 0]
% proc changeMe {&node value} {
      upvar ${&node} node
      set node [list [lindex $node 0] $value]
  }
% changeMe myLeaf 8
% set myTree
{ {a 1} {b 2} {d {{e 8} {f 4}}} }

All much in the same way one used to be able to upvar to an array element, and then use that as an array, to create more complex data structures.

As an added note, I don't agree that creating an infinite string necessarily violates the rule that everything is a string. Its still a string, it just happens to be infinite in length. You can still act on that string (or list) and do things with it, you just can't print it out.

Lars H: My angle there was more practical -- any attempt to get the string (in the sense of the Tcl_Obj API) will fail due to lack of memory, hence it won't work. But if you want to take a philosophical point of view then you should observe that infinite strings come in many varieties, and that the most common type (string with beginning but no end) is not the one you need to encode lists containing themselves.


MSW: I don't think we have to yell "Tcl has it, too" for everything someone demands. We have Tcl, it works & solves the problems we solve with it. Voila! It's nice that you can do a lot in Tcl to extend it, but most of the time the constructs are just plain ugly. It's nicer that you can easily extend it in C to add what's lacking, and so we do all the time. If RMS wanted ref's, should've added them to the C-core,

package require references

and he'd be happy. But noooo.... :)

RHS: My only issue is that Tcl used to have the ability to create more complex data structures in a natural way... via arrays as elements of arrays. When 8.0 came out, and Tcl_Objs were introduced, this ability was lost, along with the simple power that came with them. It would be nice if we could find a good way to reintroduce that ability. AMG: Try dict.

Lars H: RHS's claim is simply wrong. The only way you could stick an array into an array in Tcl 7 was to put the array get of one array into an entry of another, and you can still do that in Tcl 8; I've done it a lot in both Tcl versions. The main difference with Tcl_Objs is that list elements can now be accessed in constant time; the inability of older Tcls to do that was the basis of most claims that it was impossible to build complex data structures in Tcl.

RHS: Lars, did you think to actually test it? I'm not making this up, I'm speaking from experience.

(2)>> /usr/local/src/tcl7.6/unix/tclsh 
% proc setValues {var keyValues} {
    upvar $var arr
    foreach {key value} $keyValues {
        set arr($key) $value
    }
}
% proc getValues {var} {
    upvar $var arr
    foreach {key value} [array get arr] {
        lappend retlist $key $value
    }
    set retlist
}
% setValues myArray {a 1 b 2 c 3}
% getValues myArray     
a 1 b 2 c 3
% setValues myArray(D) {X 4 Y 5 Z 6}
% getValues myArray
D  a 1 b 2 c 3
% getValues myArray(D)
X 4 Y 5 Z 6
% info patch
7.6

Lars H: I stand corrected, but feel excused by the fact that you claimed it could be done in a natural way -- values in global variables that cannot be accessed directly hardly qualifies, so I certainly hadn't tried that.

Indeed it is not at all clear (from the above) that this is not a bug in upvar rather than a feature. For example, I rather get

% set s ""; foreach ch [split [getValues myArray] {}] {scan $ch %c code; append s [format \\x%02X $code]}; set s
\x44\x20\x7B\x0A\x62\x0F\x04\x0A\x62\x0F\x40\x0A\x5B\x4F\x80\x0A\x62\x0F\xF0\x7D
or, with strange characters replaced by *:
D {
b**
b*@
[O*
b**}

which is rather unintelligible, but at least is consistent with the program (the return value you report seems to be a list with an odd number of elements, which is inconsistent with the getValues procedure you claim produced it)), which doesn't look right at all. I did however use what is rather a Tcl 7.5 than 7.6, so there might have been fixes to it in between.

That aside, this discussion should probably be moved to the array page, as it has nothing to do with linked lists.

RHS: Fair enough. I was under the impression it worked from 7.4 to 7.6, actually. Its been a long time since I worked on 7.6 and even longer since anything before that. You're right about the fact that it seems like a bug. I've been told many times that it is such, I just happen to like it anyways :)

As for the natural part, I guess its a matter of personal preference. I like being able to hand part of a recursive data structure down to a child proc and have it treat it like a whole data structure... like with linked lists, nested arrays, tree structures, etc. That's what I find to be "natural" about it.

I've since reimplemented "part" of the functionality I liked in pure tcl (using traces) as a "package" called Shadow Variables. Its not fleshed out, but it works. You're welcome to take a look at it at http://robert.rkseeger.net/shadow.tar.gz if you're interested. Its completely undocumented, but the test cases with it should give a good idea of what it does and how it works.

ZB 2010-07-08: If I'm not wrong, a simple linked list can be made quite trivial way:

% set a {b 1}
b 1
% set b {c 2}
c 2
% set c {a 3}
a 3

So we've got 3-elements connected into "circular linked list" - a "ring". The first component of each element is the "pointer" to the next one, and the second component is the actual data (here: just simple integer). How to access every single element?

Initialization first:

set x $a
b 1

...and then each time using still the same command:

% set x [set [lindex $x 0]]
c 2
% set x [set [lindex $x 0]]
a 3
% set x [set [lindex $x 0]]
b 1

Similar way we can create "bi-directional ring".

See Also

Tcl and LISP
Deferred evaluation
uses traces instead of specific memory management. RS wonders in what aspects willset is weaker than the above constructions...