Version 12 of Better Arrays for Tcl9

Updated 2007-12-31 00:55:48 by dkf

(Pulled out of Tcl 9.0 WishList in order to keep that page shorter and easier to read...)


KJN - Remove arrays from Tcl, and use the name(index) syntax to mean dictionaryVariable(key). If the value of index is permitted to be a Tcl list, then it can be interpreted as dictionaryVariable(key ?key ...?). RZ Make it possible to add traces to name(index) so it can be used as textvariable in Tk.

Implement the command array (except array exists) as a (deprecated) ensemble of pure-Tcl commands to ease the migration path from 8 to 9.

There are clearly incompatibilities that prevent a dictionary variable from being a drop-in replacement for an array: hence this suggestion is for Tcl 9.

  • A dictionary variable is an ordinary variable, and can be overwritten with either a dictionary or non-dictionary value; whereas an array cannot be overwritten with a scalar value, or a scalar variable with an "array value", and any attempt to do so will throw an error. The subcommand "array exists" has no meaningful counterpart for a dictionary variable.
  • A dictionary key path of varName(foo bar) would imply the existence of key path varName(foo). This is not the case with the array index arrayName(foo bar). Also, the parser currently rejects arrayName(foo bar) if the index is specified as a literal, rather than (for example) by arrayName($index).


KJN - The discussion below was last modified in 2001, and many of the suggestions have now been implemented with the dict command in Tcl 8.5.

Better Arrays. Arrays should be objects that can be passed around and returned.

DKF - Some of this might in fact find its way into the core before 9.0; c.f. dictionary values.

MSJ - Is the idea here to use the syntax of present day arrays, but just to store the value like in Jean-Claude Wippler's Dictionary? [Do you perhaps mean Frederic Bonnet's dictionary? ] In this case I think it would be a huge improvement, as not only would one get more functionality, but the TCL "everything is a string" philosophy would now also apply to arrays (I've never really understood why arrays had the limitations in the first place). The way I understand it, the following would apply:

 set a(fruit) apple
 ==> apple
 set a(vegetable) tomato
 ==> tomato
 set a
 ==> fruit apple vegetable tomato
 set basket(contents) $a
 ==> fruit apple vegetable tomato
 set basket
 ==> contents {fruit apple vegetable tomato}

I would like to add a few requests so that slightly more complex data structires can be handled. I would like to be able to do:

 set basket(contents)(fruit) pear
 ==> pear
 set basket
 ==> contents {fruit pear vegetable tomato}

It would also be nice if lists could be easier to use than at present, e.g.:

 set xx [list apples bananas cherries]
 ==> apples bananas cherries
 set xx((1)) [list BANANAS BROCOLI]
 ==> BANANAS BROCOLI
 set xx
 ==> apples {BANANAS BROCOLI} cherries
 puts $xx((2))
 ==> cherries
 puts $xx((1))((0))
 ==> BANANAS

DKF - Maybe. The problem with doing something like this is it obscures a number of other useful functionalities (arbitrary array names come to mind) and there is code about that tests for presence, absence and/or type of variables by using [catch] which would cease to work under such a change. That isn't to say that this is a bad idea, but the side effects are quite long-reaching and may not all be obvious. I have also corrected a few minor errors in your examples above; Tcls association lists - those accepted by [array set] - are not the same as TclXs keyed lists; they have fewer layers of quoting...

MSJ - I suppose what I am saying is "in the process of reforming arrays, please don't forget arrays & lists embedded in arrays". I do realise that the double parenthesis proposal would require changing the satus of parenthesis to a reserved symbol that would need to be escaped in certain sequences much like the $ symbol. This would of course break some existing code, but would be beautiful - everything would be a string, but with the symbols $,( and ) one would be able to reduce the use of functions set, lindex, lreplace, array etc, for basic tasks such as saving/retrieving a value in an array/list. One could of course also use a less common symbol (e.g. @ or $$) instead of parenthesis, but this would not look as good. BTW, was your answer to my first question a yes with the changes that you made?

DKF - Yes. Are there any overlaps with BLTs vectors?

MSJ - Don't know. I just had another brainstorm that I think will fix the present difficulty with embedded data structures without backwards compatibility problems - One function, let's call it struct for the purposes of demonstration:

 The first argument is the variable name
 Hereafter, the arguments come in pairs
 First argument in the pair is a charcter and determines how the arguments just before and just after it are interpreted:
 A . has the same meaning as in C (the argument before is a Tcl-array (C-struct))
 An @ means that the argument before is a list and the argument after is the index
 An = means the next argument should be assigned to the previous
 % foreach idx {a b c d} {struct x . $idx = $idx$idx}
 % set x
 ==> a aa b bb c cc d dd
 % foreach idx {1 2 3 4 5} {struct x . vec @ $idx = 4$idx}
 % set x
 ==> a aa b bb c cc d dd vec {{} 41 42 43 44 45}
 % struct x . vec @ end
 ==> 45

One can of course work on allowing more complicated indices after @ characters e.g. "end+1", "2*$i+5", "1..end". Instead of ".", "@" or "=", one might also write "delete" to delete an item.

RS -- Can't the deficiency of

 set foo(x) 1
 1
 set foo
 can't read "foo": variable is array
 set foo {bar grill}
 can't set "foo": variable is array

be overcome even before 9.0 by just calling array set/get in the positions where these two errors are thrown? Performance may get lost if huge arrays are recast to lists and back, but in the object representation one bit might indicate array state, so converting $foo to a list would be needed for accesses like lindex $foo 1 .

JCW I've been pondering on whether arrays can be virtualized (for doing things like pulling data off the disk on demand, or for presenting an array layer on things like /etc/passwd, *dbm, or eh... Metakit). Right now, one can do a lot with traces to fake it, but not everything. The "array names" command won't work unless elements actually exist. Read traces can fill an array lazily but not keep the size such a cache within bounds. Write traces are fine, but there is no "raw write" to fill in data without triggering such traces. Plus, somewhat related, I see a need for "pre-traces" ... executing code before a value changes or goes away. With a virtual interface, one might even be able to move the implementation of arrays into a separate extension.

JCW Another idea is to reconsider the dichotomy between arrays and lists. Both store collections, but one of them is indexable via ints, and the other via a more general key value. Lua has generalized this into the wrong direction IMNSHO, and stores lists in arrays ("tables") - albeit very efficiently due to some clever ideas. The opposite direction would seem much more general: lists store aggregate data, with arrays being a special case. Special in two ways: the odd-even format is an obvious (and natural, in Tcl) way to create an equivalence. The other trick would be to use a secondary/hidden list for hashing / buckets / overflow. This does not necessarily slow down things (quite the opposite in fact, I expect). What it does mean is that fast/hashed access is a property one can slap onto any pairwise list. FWIW, I have started doing this in Metakit with good results so far. It has an extra benefit that the hashing information becomes explicit (a bunch of ints, which can be stored like everything else). Hm... perhaps I'm not making too much sense with all this. It's a bit cryptic, I'll think about a way to describe this more clearly...

RS Let me see if I understand you right. We start from even-elemented lists (2 is even too):

 set x {}
 lappend x John 123    ;# set y(John) 123
 lappend x Mary 456    ;# set y(Mary) 456
 @ x John 234          ;# overwrite: set y(John) 123
 @ x Mary => 456       ;# set y(Mary), or: $y(Mary) 
 @ x => {John Mary}    ;# array names y
 expr {[llength $x]/2} ;# array size y

What I don't see yet is how to do hashing with a (hidden) list - given a key to search, lsearch with check on odd/even index would be required in both cases. The advantage of lsearch however is that you can search by key and value, which I've used in some translation tables (e.g. Morse en/decoder). But to replace hashtables with lists in equal efficiency seems difficult to me, and on the C side they're a fact of Tcl life anyway..

JCW Nice interface! After I wrote that, it occurred to me that I could actually code this up and have a go it it (trace-less, but mutable and storable). Ok let me try to describe things further...

  • let's call this thing a map for now
  • the internal rep of a map is two lists
  • one list is the pairwise key/value contents
  • the other is pairwise too, but hashcode/index
  • the size of the second list is larger, it has spare slots
  • lookup hashes the key, looks it up in 2nd, then finds index
  • when the 2nd list fills, it gets re-sized (in prime-size chunks)
  • when its use drops below 2/3's, it is resized down again
  • the implementation I use is somewhat derived from Python's

I'm omitting several details, but this thing works quite nicely in MK (it's only just now starting to get used, and banged on). It's fast (some Tcl'ers find it shocking to hear that there might be more efficient implementations than theirs, so I won't say that), though it's going to be hard to compare with full arrays, also due to the lack of traces, and the fact that upvars and present-but-seen-as-not-set items don't work on such a beast anyway.

But there's a nasty issue lurking in Tcl's current design. Right now, one can have lists with an efficient dual representation underneath, but one cannot mimic the behavior with other datatypes. Example: doing "llength $x" would kill the map. Oh, it could be set up to return the correct result, but after that it's a list object, and recreating a map from it is going to require a complete recalc of the second list. There is no way to make llength work, other than by altering it to peek inside and see if it is a map, then determine its length. Or by creating a "mlength", which casts to a map, and properly guards its representation. But that's not very OO: I want the length of something, in a polymorphic way - let the object itself determine how that gets calculated. Unfortunately, "inherent object type" is not a concept that fits in well with Tcl: is the length of "1 2 3" supposed to be 3, or 5?

A day later... see Adding a hashed datatype -jcw

AK: What Jean-Claude desires in the last paragraph is something that Paul Duffin called 'interfaces'. Where an object may declare that it supports the FOO interface despite not actually being a FOO. In the case above the map could declare to support the list interface.