What If: Everything is a Thing

RHS 08April2005

I'll start this out by saying that many of you will read this and immediately think "But, then it's not Tcl!". This is a perfectly valid statement. However, what I'm aiming at here is not to propose changes to Tcl, but to discuss what the benefits and problems would be if... Everything is a Thing .

Moving on to what it is I mean by "Everything is a Thing"... I originally meant to title this writeup "Everything is an Object". However, after some discussions in the chat room, I realized the word Object carries too much baggage with it. What is meant by the title phrase is just that Not everything is a string.

== Issues with Everything is a String ==

Currently in Tcl, it is stated that everything is a string. This leads to a number of complications with the language, both in its use and in its implementation.

Command/Thing Lookup

When creating new things that should be anonymous commands, such as lambdas and objects, there is a need to either pollute the proc space, or to perform various complicated workarounds. For things that require internal data, a handle is generally used that is tied to a proc of the same name. While this is an adequate approach, there are various conceptual reasons for the desire not to create a command name for every such thing. One of these reasons is the possibilty of name clashes.

When trying to avoid polluting the proc space, it is mandatory to have the string representation contain all the information about the thing. This approach, combined with various techniques (such as leading word expansion, etc), leads to true anonymous commands. However, even for places where all the information about a thing is able to be placed in the string rep, the approach fails when that data is mutable.

Arrays are another example of having to work around the string rep limitation. Consider that arrays are a hashmap to a group of anonymous variables. The fact that there is no string representation for variables means that we cannot pass arrays as arguments. Instead, we need to pass in the name of the array and upvar into it.

Automated Resource/Garbage Collection

The second real problem is that of automated resource and/or garbage collection. This problem is only true for things that require internal data and/or are tied to a named command. Tcl's automated resource/garbage collection is based on the idea that a value is available to be cleaned up once there are no variables pointing to it (i.e., its RefCount is 0). For things that are not referred to via variables (such as commands), there is no way to clean up after them, since we never know when the code is done with them. The fact that they are not anonymous means that they can be accessed later, even if there is no variable containing a reference to them.

== The Two Types of Things ==

I propose that there are two types of things. The first type are those such that each one is-a value. The second type are those such that each one has-a value.

Is-a Value Things

Things that satisfy the condition that each one is-a value are those that currently exist as first class things; lists, strings, dicts, integers, etc. Each one is such that the value is the thing. There is nothing about them that isn't naturally stored in the string representation, and each one is immutable.

By immutable, we mean that the thing itself cannot change. Instead, to alter the value we create a new thing that has the new value and point the variable at the new thing. The underlying code can make optimizations to avoid the cost of copying an object but, even then, they need to make a copy when multiple variables point to the same object.

These is-a things would be handled in exactly the same way as they are now.

Has-a Value Things

These things are those that either do not meet the requirement that the natural string representation holds all the information about the object, or that need to be mutable: arrays, lambdas, objects, file handles, etc. It is with these that we will make changes in how they are handled.

For these things, we will state that they cannot be converted to other types and retain their identity. This is similar to the way arrays work now. While we can ask an array for it's list representation, that list is not the array; it does not contain all the information the array needs to be what it is. The difference is that we would be able to refer to the array using code such as:

 # Setup the array
 array set tmpArr {a 1 b 2 c 3}
 # Pass the array into a command, and it is free to modify it
 sortArray $tmpArr

When we ask for a different representation of the thing, what we get is not the original thing. Instead, we some other thing that contains all the information about our original thing that is natural for it to contain, but is not our original. For example:

 array set tmpArr {a 1 b 2 c 3}
 llength $tmpArr

In the above, the llength command asks for the list representation of the array. Once it has this, it calculates the length. However, the thing it calculates on is not the array itself, but a list with the keys and values that the array contains.

== The Results of the Change ==

The implications of the above constraint are fairly far reaching, and allow us to ignore many of the hoops that we previously had to jump through.

Automated Resource/Garbage Collection

Now that we know only the original (underlying) thing can hold the actual thing, we can apply the RefCount method of garbage collection to automate cleaning up after ourselves. As examples:

  • File Channels can be closed, and their resources cleaned up, once they're no longer being used
  • Object and Lambdas can have their resources cleaned up once they no longer have a reference to them. (see Command Resolution below)
  • Arrays can be passed as arguments

Command Resolution

Since we now have things that are of a given type only if they are of that type, we no longer need to answer the question of "how did the caller intend this to be used?" Instead, we can say "What type of thing is this?" This provides us with a great boon for commands resolution. When we have a command, we can follow the following logic:

  1. Is it a thing that can act as a Dispatcher (object or lambda)? If so, invoke it with the args. Otherwise...
  2. Ask the thing for its string representation and call the command whose name is that string.

== Positives Summary ==

  • Adds the ability to have lambdas with state and garbage collection:
 set lambda [lambda x {n 1} { incr n $x }]
 $lambda 4 ;# -> 5
  • Much like lambdas, adds the ability to have objects with state and GC
  • Arrays can be passed as values

== Negatives Summary ==

  • Some places, where shortcuts could be used before, would need to actually use [list]:
 set fd [open ...]
 bind .foo <Button-1> "do_stuff $fd"        ;# We know fd doesn't contain spaces, so we cheat here
 # bind .foo <Button-1> [list do_stuff $fd] ;# Would need to be done this way now
 unset fd
  • This would likely cause a very significant change to the C-level Tcl API

== Other Notes ==

  • Circular references are still an issue, as long as RefCounting is used.
  • This discussion isn't in preperation of a TIP to propose the changes to Tcl... it's just an informal discussion of what the repercussions of such a change would be.

Discussion

DKF: What would be the negative consequences of altering the basic model of values this way? (It is late, and I cannot quite spot them at the moment myself.)

RHS: The only negative I was able to think of, from the programmers point of view, is the argument "In Tcl, Everything Is A String. If you change that, it's not Tcl anymore". That is not to say, however, that there wouldn't be other issues... just that I wasn't able to come up with them.

RHS: Per discussion on the chat, I thought I'd add that another negative is the likely huge change of the C-level API for Tcl that this change would involve. While the cases where it would make a difference at the Tcl level are rare, almost everything at the C level would need to be aware of opaque things.


SYStems I am probably speaking too early, but it also late and I didnt wonna forget this opinion

First, we need to find out the name of the topic we are discussing, second, we need to reference and summarize the literature of that topic, I think you are breaking your head re-inventing the wheel, find out this topic name (is it type theory, object orientation, or what exactly) second, buy and read the best books written on it! This is the obvious advice

Another thing, is on the two type of things, thing. Let assume this scenario, Ahmed, is married to Gilane, both are humans (same type) , so lets say you want to call Ahmed's wife, you have two means, you can call Gilane, cause you know Gilane is Ahmeds wife, or you can simply call for Madame Ahmed. Three years from now, Ahmed and Gilane get a divorce, and he marries Mona. Calling for Gilane as Madame Ahmed (or Misses Ahmed, please correct my english), doesn't work. But maybe you are not interested in Gilane, you just wonna call, whoever Ahmed is maried to. Ahmed can be used as a reference to his wife, but Ahmed also have a life of his own, actually even if Ahmed Dies, Gilane can still live.

Plus there is more to Ahmed than just a wife, Ahmed also have a car and a house, all of which can be reference through Ahmed, all of which can live after Ahmed dies (a related reaserch topic might be Object life cycle)

So even if we never talk about Ahmed anymore, we might still wonna talk about his house, only then, we won't refer to it as Ahmed's house, see, when Ahmed and Gilane got a divorce, she got the house, the house is her responsibility now, and we refer to that same house throught Gilane (see relational databases, keys, candidate keys, primary keys, etc...)

So as you see, grade in uni "GPA 3.7", is a value, it refers to "GPA 3.7" and also refers to how brave I was in uni.

Therefore, there is no such thing as have a value, is a value, the only immutable value, is the most elementary one! In real life, I have no clue what that is, in computers, it's I and O, in Tcl, it's recommmended we think of it as the string (don't dig below a string, even if you can), everything thing else change over time as the result of the transactions (see database literature) applied to it.

And A can refer to the ascii code <insert asci code here>

 % scan A %c 
 65

and it can refer to my grade in the database course, a value is in the eye of the beholder!

I wish I made any sense, I know I still need to read a lot, please recommend anything

RHS I'm really not able to follow where you're going with the above. When you have time, would you consider rephrasing and shortening it? Thanks.

SYStems I wanted to point out, that everything is a value that is composed of other values, and you go down until you can't anymore; in Tcl this lowest layer is called a string. Let us consider this, {Ahmed Youssef} is a value, and at the same time, it's a key to an entity far more complex than this simple key; {Ahmed Youssef} is a value and has a value, both; it doesn't matter much, thing is. Within a Tcl prgram, you cannot extract more values from {Ahmed Youssef} except what it is; {Ahmed Youssef} is a value expressed in the most elementary form Tcl handles, static text! a string.

Now let's look at a command; expr. Now, expr is a key to a value (a calculator) and there is more to expr than just 4 chars, and Tcl knows it's to see the real monster behind expr you need to do something like

 info args expr
 info body expr 
 # I know it won't work because expr is a command, but it can't hurt to return the c code!

This way, you will know all tcl knows about expr, as for {Ahmed Youssef} there just is not anymore to it! But if you want you can decode {Ahmed Youssef} to mean whatever you want it to mean (you can then decode a string to see it's binary representation, and use each bit as a flag to something special)

expr is an encoding of the value, that could have been returned by info

You are better of thinking of expr and the value that expr is, as the same thing! cause they are ...

My most favorite approach to create anon functions, is the one what ppl wrapped the proc procedure into a function that return a unique key! the unique key can be though of as the string representation of the proc, you will of course need tcl info help to be able to decipher it thought!

RHS There's a difference between something that is a string, and something that you can retrieve enough information about to recreate it. For example:

  • An array is not a string, it's a collection of variables. Each which has a value (which is-a string) plus the other information associated with a variable (traces, etc).
  • A proc is not a string. You can get all the information to recreate the proc, but you can't use $procName to get it's "value", as it doesn't have a pure value.

RS 2004-04-09: I think the is-a and has-a distinction is elsewhere called transparent vs. opaque objects. Transparent objects ("pure values") have no garbage collection issues, as Tcl does that already - and up to a certain size, just modifying a copy is pretty sufficient. However, for a list of 100.000 items, changing it in place with lset is much more efficient. But that requires a variable, which is about the smallest "opaque" object we deal with all the time... See also TOOT.

RHS Variables are always mutable though. What I'm talking about is a level lower, where the Tcl_Obj itself, effectively, does or does not represent a pure value (i.e. has or does not have a string rep). The case is the most obvious, the way things stand now, when looking at arrays. If we had values that didn't need to be able to shimmer to other types without losing information, then we could pass arrays into procs as first class objects.

NEM - RHS, have you read the Feather paper?

RHS No, I hadn't read it. I'd looked at the extension, but never noticed there was a paper. I'll have to go read that. From what I've heard, the thoughts behind it are quite impressive (and probably more thought out than what I have written here :)

NEM Yes and no. Some of the ideas are very interesting. Some of them are less so. Your "has-a things" above are similar to Feather's opaque values. There are two types of thing that currently aren't represented as strings: those for which there is no natural string rep (e.g. C-coded commands), and those which involve state (channels, arrays, vars, objects, etc). In both cases, you need to put the "real" structure somewhere else, and then use a name (key) to look it up when needed. Now, from what I can gather, you are proposing that we should be able to store this "real" structure in the internal rep of a Tcl_Obj, so that it can be passed around without having to do name lookups. As this is the only copy of the structure, you need to prevent it going away. So, you mark the Tcl_Obj as opaque (or some such), which prevents the internal rep from being destroyed (in a conversion). Then, any convert-to-type operations must take a copy of the Tcl_Obj. That part seems reasonable. However, you now have two Tcl_Objs which are "the same", in the sense that one is a copy of the other. However, they are not equivalent -- only one of them has the magic extra state. So they will be indistinguishable at the script level, except that they will produce different behaviour in certain situtations. This is a violation of an important principle that lies at the heart of value semantics (i.e. what distinguishes values from variables): referential transparency. This only gets worse in the cases where the hidden state is mutable (mutable opaque objects, in Feather terminology).

So, while you could make the change you propose, and with a consistent semantics, it's no longer Tcl, and it's not a change I would advise. Values are representations at the script level. Hidden state is, by definition, not represented at the script level, so it seems a mistake to try to graft on support for hidden state to the representation of values. I'd argue that we should be making more things explicit at the Tcl level, and relying less on hidden state, rather than the other way round.

RHS While I agree that it's better to rely less on hidden state, there are a number of things that just cannot be handled without it. Due to this, there are inconsistencies in the "everything is a string" mandate, and there are many hoops to jump through in other places. My thought is, basically, to discuss what happens if we admit that not everything is a string; what benefits and problems doing so brings with it.

NEM: By "hidden state", I was referring to state which is not represented at all in the value representation -- i.e. not directly, or indirectly via a name. I was arguing that such implicit (that's probably a better term) state is always a bad idea. The current mechanism Tcl uses is to use a name (which is a string), and then use that to lookup the state. The key difference is that you regain the property that two values that look the same at the Tcl level will always refer to the same "thing". The state is still hidden, but at least you know it's there. With your proposal (and in Feather) there are situations where two values are syntactically indistinguishable (at the script level), and yet are semantically distinct.

RHS Yes and no. While it would be impossible to "visually" tell the difference between them, it's feasible to extend the script level comparison methods to be able to compare opaque things. For example, and opaque Tcl_Obj could be required to supply an isEquals command, much the way all Tcl_Objs are required to register a cleanup method now.

NEM: So, I'd have to do something like [someObjType isEquals $a $b]? This doesn't solve the problem though: you are still relying on implicit state to distinguish syntactically identical values. There are many languages which do this, of course. But the fact that Tcl doesn't do this is one of the things I like most about the language.

RHS: Actually, that's not what I meant. What I was getting at is that you would still be able to do:

 if { $myArray == $somethingElse } { ... }

Under the hood, Tcl would look at myArray and say "This is not a pure value, I'll call it's comparison method with somethingElse to see if they're equal". Ie, it would do comparisons the way it does them now, unless one (or both) of the values were opaque things. In that case, it would call the registered comparison command for one of them, handing it the other as an argument.

NEM: It doesn't matter whether you introduce a new operator at the script level, or change the behaviour of ==. The result is the same: introducing semantic differences at the script level on operations over syntactically indistinguishable (at the script level) entities.

RHS Why is it that you'd consider them indistinguishable? If the == operator says they're different, then they're different. What, at the script level, is saying they're the same? Consider it much like the handles we use now for file channels, only you're not actually allowed to look at the real string rep.

NEM: They are syntactically indistinguishable. OK, Tcl's syntax is based on strings, so this is a slightly circular argument. However, the difference is that with EIAS, I can examine two values in several ways: I can see if they are equal; if they aren't, I can examine them to see how they differ; I can compose different string reps together (composability is a key requirement of syntax, after all); etc etc. I can't do any of that with implicit state, unless you supply a whole raft of functions for every new type. What this means in practice, of course, is that you actually expose the representation of your implicit state at the script level (i.e., make it explicit). But everything which can be represented, can be represented as a string. So what have you gained? A lot more complexity, for very little gain.

RHS My argument, however, is that this issue already exists in Tcl, for things like arrays. By adopting a specific way in which to handle such cases we can make other it so other types of things which naturally fall into such a category are much simpler to implement and use (objects, lambdas, etc).

NEM We're going round in circles here. Named arrays/vars/channels etc are emphatically not the same as implicit state -- the name is the representation. It's not a perfect representation, and requires some context, but at least it's there. There are some things which cannot be statically represented up-front. This includes things like channels, vars etc, which need contextual information. So, you're right in that this is always needed. You need some key/name in order to recover this contextual information. You are arguing that this key should not be a string, but instead be the pointer to a special, immutable internal rep. The problem with this, is that only certain operations know about the internal rep, so you get inconsistent situations, where e.g. copying a value loses it's context, but other operations are dependent on the context. All operations know about strings though. You can still lose context with names of course: they can't be reliably transferred across interps/processes/machines in general, and e.g. closing a channel will remove the contextual information that a name refers to, but there are much fewer situations where this happens with EIAS, and (as with GC), the cases where context is lost in EIAS are a proper subset of the cases where context can be lost with opaque Tcl_Objs. The benefit of EIAS is that you have a single representation mechanism, so everything is copied correctly. What I think would be an improvement would be to move towards a single mechanism for contextual information, and that this mechanism should be based on names which are strings, rather than on pointers to things of arbitrary types.

RHS For arrays, the programmer has to take care of "bringing the state with the data". This is handled by using upvar.

Yes, the cases where context is lost in EIAS are a proper subset of the cases where context can be lost with opaque Tcl_Objs... However, the cases where context is leaked in EIAT are a proper subset of the cases where context can be leaked with EIAS. Given the difficulty of automatically cleaning up context with EIAS (so difficult it just isn't done in Tcl), I tend to think it might be worth the tradeoff of having to do things right in certain places.


RS: If you can go without array element traces, the dicts available from 8.5 have all the other array functionality, while being pure values. And I think that "everything has a string rep" is too valuable a feature to throw overboard...

RHS As I admitted right at the beginning of my "paper", the "In Tcl, Everything Is A String" argument is a powerful one. However, my point was merely to discuss what admitting that that statement isn't true gains/costs us. In what situations does something that cannot be represented as a first class value (file handles, arrays, etc) actually lose something if we make it so they are considered "references" instead... so that can be passed around, etc?


Peter Newman: RHS, I basically agree with you. And Tcl already can do what you want. The problem is that JO never clearly defined what everything is a string means. And different people interpret that concept/idea in different ways.

In my opinion JO, like Larry Wall with Perl, was looking for ways to make programming simpler - and eliminate the complexity of C. And everything is a string is the part of that that says let's make it so that all parameters passed between functions are passed as strings. That eliminates the need for type casting - which is a major cause of the extra complexity of C, compared with Perl and Tcl.

But it DOESN'T mean that any function, internally, has to process stuff as strings. Nor does it mean that data has to be stored in string format.

That's why in Tcl, we pass (opened) files by reference - using their file handles.

And though lists are currently passed by value, they could also just and easily be passed by reference. (And IMHO,it would be better if they were). You could do this by giving lists list handles - and passing that to the list functions.

Everything is a string simply means that the list handle would be passed as a string (just like file handles are). The list, internally, would be stored in whatever format makes for the fastest/most efficient processing.

It may be that I've mis-understood what you're saying. But as I see it, Tcl already allows everything to be a thing.

RHS I think you've misunderstood what the main thrust of my writeup was getting at. Currently, in saying that Everything Is A String, we mandate that everything (including handles) must be able to be converted to a string, and then back again with no loss of information. For example:

 set fd [open myfile.txt r]
 set notFd [string range $fd 0 end]
 unset fd
 gets $notFd

The problem is that, by mandating this, we run into a couple problems that can be hard (or impossible) to work around:

  • We cannot cleanup the resources/memory of the internal data of the thing, since it is almost impossible to tell if it can still be referenced
  • We cannot ask (at the C or Tcl layers) Is this thing of the type X, since it might not be of an X at the time we ask.

By saying "Some things just aren't strings", we free ourselves of that limitation:

 set fd [open myfile.txt r]
 set notFd [string range $fd 0 end]
 unset fd
 gets $notFd  ;# --> error

The variable notFd is set to whatever the file channel (fd) considers it's string representation. It is not, however, equal to the file channel... nor can it be converted to one.

NEM Leaving aside problems of whether automatic garbage collection of external resources (e.g. file handles) is a good thing, there are clearly two issues here: one is with the management of stateful things; the other is with typing. I agree that management of stateful entities (variables, objects, channels etc) is a weak point in Tcl currently. However, I think the correct way to "solve" this problem (if things really are that bad), is at the level of names/variables. Names can already be passed around quite conveniently, and don't have any problems with losing internal rep. What would be nice would be to have a garbage collection mechanism which allows the registering of names to be managed by the GC (this would be simplified by an unification of naming schemes for things like commands, vars, channels etc, which is in itself a radical proposal). Jim has a notion of references that works in a similar way: the mechanism is heuristic, and there are some edge cases which it might miss (AIUI), but it is at least as good as the opaque object method. The advantage of working with names is as I have stated above: it avoids having to rely on information which is not available at the script level. The second issue, about typing, is connected with what I've just said: asking "Is this thing of type X?" also relies on a notion of type which is not explicitly represented.

RHS These issues already exist in Tcl. Using arrays as an example, an array is different than a pure value. File channels are another. Yes, it's possible to use a handle/name instead of just admitting it isn't a string. However, as long as you pretend it's a string, then resource collection isn't possible.

 set fd [open myfile.txt r]
 set f0 [string index $fd 0]
 set frest [string range $fd 1 end]
 unset fd
 gets $f0$frest

As long as you allow these names/handles to be treated as normal strings, you lose the ability to cleanup resources. There are lots of other examples that follow along the same lines. As such, my solution is to not allowed them to be treated as strings. If you need to treat one as a string, you just get a string representation for it that can't be used as the original object (i.e: array get)

Peter Newman: The above file open examples confuse me. Resources are (presumably) allocated when the file is opened, and closed when the file is closed. And making multiple, redundant copies of the file handle doesn't and shouldn't affect that. I can't see what's wrong with the way that Tcl treats either the files or the variables, in the examples you've given above - or why it results in losing the abilty to cleanup resources. The programmer can close the file, and/or unset any of the variables, at any time. And when they do so, surely any allocated resources are released?

I must be missing something. What is it that's not getting cleaned up (when it should be)?

NEM (Peter, the idea is that the [close] be done for you, automatically). RHS, resource collection is perfectly possible. The case you've outlined is indeed the case where a string-based GC will likely fail. However, I don't see this as a problem: just document for the GC that splitting file descriptors in half is a lousy thing to do, and will result in possible loss of the reference. Note, of course, that the cases in which string-based GC fails are a proper subset of those in which your scheme fails (i.e. your scheme, AIUI, would also fail for the above case, and more). So, rather than making things better by abandoning EIAS, you are making them worse!

RHS The thing is that what I'm proposing won't "fail" in this case. Once you no longer have a variable that points to the original Tcl_Obj, the resources are cleaned up. In the same way that you want to say "This is bad, don't do it", I want to say "You can't do this". Can you present a concrete example of where my scheme would fail to clean up resources at the expected time?

NEM I wasn't saying that your proposal won't clean up resources at the "expected" time. Just that your scheme will eagerly cleanup in some areas where a string-based GC will not. String-based GC will also always succeed in cleaning up. e.g., a case where your example fails:

 set fd [open ...]
 bind .foo <Button-1> "do_stuff $fd"
 unset fd

It used to be the case (not sure if it still is) that binding scripts in Tk were string-based, and so in this example, your $fd would be collected sometime after the unset fd, but before the binding has finished with the reference. Opaque objects lose references more often than strings do, and not all of these are obvious.

RHS Thanks! Thats the type of example I was looking for. I wonder, however, if it's possible to have the underlying C layer handle the above case. If we consider that double-quotes group, would it be reasonable to have the C layer consider the above The string "do stuff " followed by the value of the variable "fd"? If not, then there would be a difference at the C level in that you'd need to use [list] to create the callback.

As an added note, this is a place where one should really be using [list] anyways. While it's possible to use double quotes here because we know fd doesn't have any spaces in it, we are still taking a shortcut. The correct way to do it would by with [list], which would work with opaque things too.


RS: I'm not sure I understand why automatic resource cleanup is so important. But maybe I'm thinking too traditionally... I distinguish two kinds of values:

  • persistent, will live over the lifetime of the program - global variables, including in namespaces
  • transient, will live only in a proc context - local variables

And that every open shall be matched by a close is an old rule... Should a file handle remain open due to incomplete error handling, there always [file channels] to find out.

RHS: However, there are other things that have a need to be cleaned up after... objects, lambdas, etc. While these things are not part of the core language now, people have been asking for them for ages. Personally, I consider automated resource cleanup to be something I'd very much like to see for both objects and lambdas.

On the note of file channels, I don't understand the desire to not have them cleaned up. I know they should be closed... but I don't know when it would be a bad idea to have them cleaned up automatically just in case (the same tends to hold true for db connections).

jcw - Perhaps a bit on a tangent, here's an example of the need for resource cleanup: data, not in memory. Say you have more data than is convenient to load and bring into Tcl_Obj's (not to mention startup/shutdown times), and you want to access and modify it. There's two ways to go about it: 1) create a database system outside Tcl, and have it pass info in and out of Tcl, or 2) find a way to refer to that external data (handles, references, pointers, whatever).

Approach #1 works fine, all you need is to keep the entire database "open" (say as a command object), and close/delete it when done. But you end up with results which have no connection to the "real" thing, i.e. copies that live in Tcl_Obj's, with independent lifetimes. There's no way to deal with the results of a large query without creating a complete in-memory copy. Well, there is, but then you end up with approach #2: passing some sort of reference to the large query to Tcl without passing the complete result set. You can then use that reference to find out the number of results, get individual values, etc.

The trouble is cleanup: if the reference is a string, then the database has no way of knowing when it is no longer used (it cannot rely 100% on the dual-rep, since shimmering can lose that). If it is a command, then the Tcl programmer must decide when to delete the command explicitly. That becomes a burden when the reference gets passed around and re-used as basis for other references. Example: a query on the database, returning a selection, followed by further requests to get the subset of values currently showing on the screen, or a certain column, or a re-sorted set.

So while Tcl does fine when it manages all data, it really has a problem with things that it does not, such as datasets which are persistent in a very different sense than what RS used above: persistent beyond the lifetime of a single program invocation. The idea of bringing all data into a Tcl process for use has it limits with larger datasets as well as when only a fraction is needed at any point in time.

Tcl badly needs a way to deal with tracked references, not just named handles. Both NAP and VKIT bend over backwards to solve this issue without changing Tcl, with partial success. In my Vlerq research project, I have a decent solution, but it requires a bit of discipline. Jim also solves it, but not for Tcl itself. It can probably be solved with a single bit in each Tcl_Obj, btw - but that's another story...

NEM Note that Jim's references are in fact named handles, which are GC'ed. I think that is the correct way to go about it: GC at the level of names, rather than values. As noted above, this approach is at least as good as (and in some cases better than) the opaque Tcl_Obj approach (which I assume is what the single bit would do: signal that any internal rep change should be done via a copy). The only difference I'd make from Jim, is that in Jim references are a new naming scheme. I'd just use regular variable names, and supply a function to register them with the GC.

RHS Consider, in Jim, that you have a string that is a handle to a given things resources. As long as this string is whole, the resources are not cleaned up. If you break up the string, or do anything to change it, the resources it refers to go away. At that point, why bother allowing it to be a string at all? Why not just say "This is not a string"? Once you do that, you've arrived where I'm aiming to go.

NEM Because it's more robust in more situations to treat it as a string: all commands know about strings, and more importantly, know how to copy strings without losing context.

RHS But, is it robust to say that you can treat it as a string, but then lose data/context/information when you perform string operations on it? There are cases in either scenario where doing things the right way is the only way to get the desired result... and doing things the wrong way will happily blow up (ie, lost context).

NEM We're having at least three separate conversations on this page now, mostly covering the same ground, so I'll just reply here from now on. The key point is that there are more cases which are "the wrong way" with EIAT than there are with EIAS.

RHS I'm not sure I'd agree that there are more "wrong way" cases with EIAT, but I'd agree it's a debatable point. However, there's things that can be done, or done naturally, with EIAT that cannot be with EIAS. Is that tradeoff worth the cost, though?


SYStems okay, my take on this discussion, my attempt to sum it up

Tcl allow you to create a value (a string, a list, procs, commands, files, etc ...), we should not forget that our scope is a Tcl programs (or system if you prefer).

To manipulate (apply an algorithm to) a value, you easier need to pass this value to the algorithm or pass a key to that value, and create a meachism, the read the value throught the key

Some programming languages who call themselves functional, always and only path values, everything is a value. Other languages are <invent paradigm name here> , always pass a key, everything is a pointer.

Tcl is not pure, some algorithm wait for a key, others wait for a value. And not all values have a string representation that tcl can use (commands writen in c)

We can make many conclusions based on this, any value that don't have a key, is immutable, cause a value without a key, can not be trace, and must be self evident. Any value that have a key (a ref) is automatically mutable.

If a value have a universally unique key, you should never be allowed to implicitly delete, since you can never know when it will bee needed

A value that don't have a key, must be automatically deleted, if after it was produced, no other algorithm received it.

The problem with array, is that rely, on somekind of pointer arithmetics, you fetch the value based on a predictibale key, an array received it's nature based on a trick played on the key, when you path an array you dont path a value nor a key, you path a key pattern!

Another issue, Tcl doesn't create a universally unique key for all value, which means when programming an algorithm you must be aware of this, and pick a style!

When using keys to refer to a value various issues arise, who can read it, who can update it? Since it is mutable, the more we think about this, every program will become a dbms!


NEM Perhaps an example of what is good about having references be strings is needed? (As opposed to being an internal rep). One of the key advantages often cited for Tcl and EIAS is the transparent (de-)serialisation, such that values can be sent over a network easily. With string-based references (i.e. names) you can even do this for context-sensitive information, e.g. channels! For example, suppose I do the following:

 set fd [open $file]
 set sock [socket ...]
 puts $sock $fd
 ...

now, it may seem that the context is lost -- how can the process on the other end of the socket make use of this file handle without the contextual information? Well, it can't directly. However, as it does have something with which it can refer to the context, it can simply make requests back to the other process:

 puts $sock [list gets $fd]
 gets $sock line

The other end just [eval]s incoming requests, and effectively acts as a proxy. This sort of thing is just another example of why having everything be a string (including references) is part of what makes Tcl so powerful. With things, as I understand it, this is not possible. At least, not with such ease.

RHS My counter argument is that, while things like the above can still be done by jumping through hoops, the places that one no longer has to jump through hoops because of EIAT are more useful than cases such as te above. I guess the question is: Which one proves more useful in the most common and general cases?. I guess, in the end, it's a matter of opinion. I tend to think that the positive points of EIAT outweigh the costs... Of course, that doesn't make me right, just opinionated ;)

NEM Can you give me an example (in code) of where EIAT would make things easier?

RHS I think the cases where I'd find EIAT most useful would be:

  • Passing arrays "by value", rather than having to upvar to them (would be faster, and easier)
  • Having actual lambdas with local vars and GC (not possible the way Tcl is now)
  • Having objects that don't have to eat a proc name
  • Automatic resource/gc of other types of data (see the example below)

As for actual code, I'll use database connectivity as an example. I'm speaking from experience using dbs from within AOLServer; that might be relevent, but I don't think so. Currently, when accessing a db, one needs to perform the following:

  • grab a handle
  • use the handle to access the db
  • (perhaps) use the result in some way
  • release the handle
 set dbHandle [ns_db gethandle $pool]
 set code [catch {
     set data [getSomeData $dbHandle]
     return [processData $data]
 result]
 ns_db releasehandle $dbHandle
 ... a bunch of code to propogate the result ...

If we could rely on the database handle resources being cleaned up automatically, then we could just do:

 set dbHandle [ns_db gethandle $pool]
 set data [getSomeData $dbHandle]
 return [processData $data]

As a sidenote, what I actually do is use a homegrown try/finally construct to do:

 set dbHandle [ns_db gethandle $pool]
 try {
     set data [getSomeData $dbHandle]
     return [processData $data]
 } finally {
     ns_db releasehandle $dbHandle
 }

However, I still think it would be easier if I could just let the db handle clean up after itself

NEM Let's take these points one at a time:

  • Passing arrays "by value". I assume you actually mean "by reference", as arrays aren't values. There are two ways that this can be approached: first, use a dict if you really want a value that can be passed around. Second, if you really want to pass a reference to an array around, and find the use of upvar inconvenient, then you can define your own proc construct that hides it for you. There is code on this wiki somewhere to do that, e.g.:
 myproc blah {foo &ref ...} { ... }

The myproc construct takes care of doing the upvar. I don't think speed is an issue (upvar is unlikely a hotspot in your code).

  • lambdas with local vars etc. Your assertion that proper lambdas with local vars and proper cleanup are not possible in Tcl currently is just plain incorrect. Lambdas are pure values: they don't need any extra state, and so can be represented directly as strings. For example here is a basic lambda that works right now in pure Tcl (8.5):
 package require Tcl 8.5
 proc apply {arglist body args} {
     lassign $args {*}$arglist
     eval $body
 }
 apply {a b} { expr {$a + $b} } 12 15
 lsort -command [list apply {a b} { string compare $b $a }] {b c a d f g j i a n}

If not yet at 8.5, use this instead (RS):

 proc apply {arglist body args} {
     foreach $arglist $args break
     eval $body
 }

It's not very efficient, and it doesn't allow defaults or "args", but it works. Mutable closures would be a different kettle of fish, but plain lambdas are fine. See TIP 194 [L1 ] for a proposal for an efficient implementation of lambda which works within the current Tcl semantics.

Likewise, objects which "don't have to eat a proc" name can be done in a similar way (see TOOT), depending on what you take an "object" to be. Besides, I don't see what is so bad about objects taking proc names. Seems quite sensible. (It would be nice if we could have proc-local commands, though, but that's a whole different argument).

In your db example, I would probably write a construct something like:

 proc withdb {var pool body} {
     upvar 1 $var v
     set v [ns_db gethandle $pool]
     set rc [catch {uplevel 1 $body} ret]
     catch {ns_db releasehandle $v}
     return -code $rc $ret
 }

 withdb handle $pool {
     processdata [getdata $handle]
 }

and let that take care of the details of allocating and releasing the handle. Tcl provides incredible facilities for defining powerful abstractions to hide all the details you have to handle up-front in other languages. I'd contend that this is really a better way to handle this than relying on GC finalisers.

To round off, I've already mentioned before that it is perfectly possible to layer a GC on top of a string-based reference system.

RHS Going point by point...

Yes, you can use dicts instead of arrays. However, they are a primary example of the hoops that have to be jumped through because some things in Tcl just aren't strings, yet we try to enforce the rule that everything is. Dicts are just arrays without hidden state, and exist primarily because arrays can't be passed as arguments. We have two features in the language that serve almost the exact same purpose, which could be avoided.

As for the speed issue, I do have code that has to pay the cost of upvar'ing. That cost is noticable, and I would prefer to be able to avoid it. I have had to disable other features because the code does not run fast enough. However, it's worth noting this code does not fall into the realm of "normal Tcl usage".

As for lambdas with state, consider an iterator as I showed previously:

 set lambda [lambda {x} {n 1} { incr n $x }]
 $lambda 4 ;# -> 5
 $lambda 5 ;# -> 10

How does one implement the above, without polluting the proc space, while still having garbage collection? To extend that a bit further, how does one implement classes/objects with resource/garbage collection? Classes and Lambdas, in my thoughts, are very close to the same thing. A lambda is to a proc as a class is to an ensemble; a class is a lambda that can dispatch to different methods, instead of just one.

As for the database example, it's true that, if you jump through the right hoops, you can avoid the code complexity in other places. Admittedly, you pay a speed cost, but I don't think that's a huge deal when dealing with databases (since the database access tends to be a bigger cost either way).

Overall, I'd say mutable lambdas and objects are probably the best example of where opaque things make sense. When you add in the fact that it makes the rules consistent in all places where something isn't a pure value, I think there's something worth considering.

NEM Dicts aren't a "hoop" to jump through; they're a very useful data structure. Arrays can't be passed as values because they are not values. You can pass their name, which is a value. I've argued in the past (and still believe) that if anything, it is arrays that should be jetisoned from the language: they duplicate functionality of namespaces (which are also collections of variables). However, namespaces are too heavyweight to replace arrays, I fear. Ideally, I'd toss out namespaces and arrays in favour of dicts and a new first-class reference (name) type -- that is arguably the best separation of concerns. The details of how I'd do this are quite long and involved though, so I'm not going to go into them here. To give a taster, using the pure-Tcl code at the bottom of this page, here is your example:

 set lambda [lambda {x} { incr $n $x } [dict create n [ref::create 1]]]
 apply $lambda 4; # -> 5
 apply $lambda 5; # -> 10

I can make that "apply" go away with a little more magic. This lambda doesn't pollute the proc space, and is fully garbage-collected, using Tcl's in-built reference counting (it's just a string). The only complication is that you need to implement full GC for the reference type. I have working code to do that. The code on Tcl references in Tcl is almost identical to what I have. You can build classes and objects on top of that easily enough.

I don't see why you think the database example is "jumping through hoops". You have to write some cleanup code somewhere. Even if you have GC, you have to write the correct finaliser functions, so that it knows how to cleanup different resources.

RHS The code for your lambdas doesn't have the lambda containing it's state. Instead, it has hte lambda containing a "reference" to it's state, which is stored somewhere else. I tend to find this "wrong" both conceptually and practically; polluting the variable space is not all that different from polluting the proc space.

In addition, the fact that it's stored as a reference makes automated GC a problem. While you can have it, it becomes limited since you need to search through every Tcl_Obj in the interp to see if a ref is still being used. As you have more and more Tcl_Objs (up to hundreds of thousands), this will become slower and slower. The nice thing about RefCounts of Tcl_Objs is that automated GC is fast. You also still have the situation where you're saying something is a string, yet limiting what you're allowed to do with that string (it's a string, but you can't use it like one).

NEM We fundamentally disagree on these points. A lambda is a value, and hence can't be mutated. If you have lambdas which can be mutated, then they must have some identity outside of the function they perform -- i.e. they are named in some way, and thus are no longer anonymous. Whether this name is a string or a pointer or whatever, it still exists. To my way of thinking, a mutable lambda is what is conceptually wrong -- a contradiction in terms. This is an inescapable fact: in order to have mutable state, you have to have indirection. Whether you hide it by hooking some semi-visible mutable state from a pointer that hangs off your value representation, or if you show it up-front with a name, it all boils down to the same thing. So the converse to your objection about "polluting the variable space", is for me to complain about you "polluting the pointer space" -- at least there are an infinite number of possible names: why limit me to 2^32 (or 2^64) lambdas? :)

Yes, a full scan GC over all variables is at least O(n) in the size of used memory. However, this is the case with all full-sweep GCs in other languages. How do you avoid circular references in your scheme (after all, a pointer from the internal rep can refer to another Tcl_Obj, whose internal rep may refer back to it.. etc)? So you will also have to do a full GC in the general case, unless I'm missing something. You still have to scan every structure. The only advantage you have is that a reference is always precisely one Tcl_Obj, so you can avoid some expensive string searches. Note though that a string-based reference approach could make exactly the same assumptions. However, it is these assumptions for the sake of speed which I don't think are worth it, as they expose implementation details at the script level. Jim already has garbage-collected references which are very similar to what I propose, and they work very well.

RHS Indeed, we disagree on the point that a lambda is/isn't a value. I think of a lambda as an anonymous proc and, conversely, a proc is just a lambda with a name mapped to it. I'd like both to have the ability to carry "state" (variable mappings) with them. An object is just an extension of this... similar to a namespace or ensemble. It does the same thing a command does, only it can dipatch to sub-commands.

As for GC, yes you would still need to handle curcular references by hand. There are some tricks that could be used at the core level to minimize this (or even, possibly, remove it) while sticking with RefCounts... but I think those are better left to another conversation. Suffice to say that things that have circular references might require a Tcl-level awarenss to be handled correctly.

That being said, I still find it to be better than a full-sweep GC mechanism. The full-sweep mechanism requires that nearly every string be searched for a reference, assuming you want to avoid the same problem with callbacks that EIAT has issues with. As far as I know, other systems that perform full-sweep don't have this issue because they deal with real references, not strings. While Jim uses the full-sweep mechanism, I think it breaks under the callback issue, and does not handle circular references.

The cost for a full-sweep is, in my opinion, too large. Scanning through hundreds of thousands of strings every time you want to reclaim memory is more than I'd want to pay.

NEM Your definition of lambda in terms of proc and proc in terms of lambda is a circular reference! :D We do have fundamentally different views of what constitutes a value. Your use of the term "real references, not strings" to me demonstrates that we have fundamental differences of philosophy. This page is already too long, let's just agree to disagree. The next step, then, is for you to implement your ideas and see how they work out in practice.

RHS Indeed, this page is all about discussion... noone said we had to agree at the end of it :) I'll get right on the implementation; should be ready sometime in 2010 (read that as: my lack of C skills means actually coding all this isn't feasible with the amount of time I have).


SS notes that while a lambda can be a pure-value, closures can't. Lambda is a facility, but when it can carry state like closures it becomes a new powerful tool to program with a new set of abstractions. This is for instance why Jim is not using procedures as values, because it is too important to have closures (see Jim Closures), and while closures can be represented as pure values the usual usage pattern is that after a closure is called the state may be modified. There is no obvious way without the closure to be handled by "name" to obtain this effect, if not with something like: set closure [$closure .....], that is an hardly acceptable solution.

NEM Again, this isn't strictly true. Mutable closures are harder to do. But you can achieve non-mutable closures with a dict, as this version of apply demonstrates:

 proc apply {lambda args} {
     lassign $lambda argl body env
     for {set i 0} {$i < [llength $args]} {incr i} {
         dict set env [lindex $argl $i] [lindex $args $i]
     }
     dict with env $body
 }
 # Use a constructor
 proc lambda {arglist body {env {}}} {
     return [list $arglist $body $env]
 }
 apply [lambda {a b} { expr {$a * $b} } {a 10 b 20}] 15

You could have nested dicts etc for lexical scoping. Now, how to do mutable closures? Well, like in ML and other languages, the key is to separate naming and mutable state, and provide a new reference type. This is what Jim does, of course. However, instead of making the lambda, or the closure, mutable, you put mutable references into the environment:

 namespace eval ref {
     variable id 0

     proc create {val} {
          variable id
          variable ref[incr id] $val
          return [namespace which -variable ref$id]
     }
 }
 set account [lambda {args} {
     if {[llength $args] == 0} {
         return [set $account] 
     } else {
         incr $account [lindex $args 0] 
     }
 } [list account [ref::create 0]]]
 apply $account 12
 apply $account 15
 apply $account ;# = 27

Now, we can discuss how to add GC schemes to the ref data type, but that's pretty straightforward, and another topic.

RHS It may be straightforward, but it's costly. As noted previously, the more Tcl_Objs you have, the slower GC gets.

SS My reasoning is: if I can model anonymous procedures as pure values I could be glad to use pure values. As I can't when procedure locals are involved and I've to resort to some kind of name for the procedure I lost the advantage of to be able to use they by value, that's semplicity and clear semantic. At this point it is just simpler to always return handle for anonymous procedures I think. I'm sure it's not simple to agree on this, it's really a matter of design/tastes... but this is my rationale. Without closures to use pure values is just better.


snichols When I read your title my first impression was Thing this sounds like Object? Tcl does have a few objects already, Tcl array and Tcl Dictionary. I believe Tcl array is Tcl's only complex data type.