Google definition of the term here: [http://www.google.com/search?hl=en&lr=&ie=UTF-8&oi=defmore&q=define:abstract+data+type]. This page could be entitled "''An argument for OO as composition of Abstract Data Types''" [LV] I presume that an abstract data type in tcl is anything in contrast to the [string] data type that Tcl has? For instance, is a [list] an abstract data type? At the level of the tcl language [string] (together with all the commands which manipulate strings) is an abstract data type, since it defines a type of data which you can operate on using the string interface commands, but you can't tell from the tcl language whether it's null-terminated or length-prepended. [List] (together with [lappend], [linsert], [lreplace], etc.) is an abstract data type for the same reason. [CMCc] I'm going with this definition "''A data type together with all the operations on data objects of that type that make use of information about the representation of the data type''" [Array] is an abstract data type, since you can access its contents with [array get] or using (), but you can't tinker with the hash collision tables which underlie the array representation. Array presents a good example of some of the benefits of encapsulation (which is implied by the ''abstract'' in ''Abstract Data Type''): * you can't change the internal representation of an array in such a way as to break it. * it has a clearly defined interface, represented by the subcommands of array * someone can change the implementation of array, without changing its behavior Each of the other builtin data structures, and many of those in [tcllib] have these benefits too. However, array and (say) ::[struct::tree] suffer from an inability to be composed together with other data types into more complex abstract data types. For example, one can't derive a SortedArray, which always returns its [array get] in a predefined order. Native array doesn't do this because it's unnecessary for most uses and it would be expensive but there are cases where one might like this: for example if one was using an array to store a time sequence of events. The ability to derive SortedArray (e.g.) from builtin Array in a coherent and consistent manner, while preserving something of the encapsulation of Array, would be a benefit of OO. How many ways are there to wrap Array in something which adds/changes its functionality? An infinite number. There should also be a single, efficient, simple, predictable, well known way. That, I think, is what the adherents of [OOTcl] are suggesting. I agree with them. ---- [RS] has never felt a need for "SortedArray" - it's easy enough to iterate like this: foreach name [lsort [array names arr]] {...} Types/subtypes/classes/derived classes are popular in other languages, but doing without them, I just feel freer in Tcl... :) [CMCc] - I presume, [RS] that you've never had more than a few instances of SortedArray and discovered that you had to change the sort order ... in OO one changes it once, in the solution proposed, it needs changing everywhere it appears. As there are an infinite number of ways to wrap functionality and compose simpler objects programmatically, and the only metric of quality being admitted here is performance, it would seem very difficult for a maintainer to recognise it when it's being done. [AMG] The [Sorted Lists] page has some related information. ---- [Lars H]: As remarked in the lengthy discussion on the [Things holding Tcl back] page, it is possible to encode data structures on top of simpler data structures, and these on top of still simpler data structures, in rather many levels. Tcl has a marvellous way of encoding [list]s as strings, but stops there, as [everything is a string]. Most computational models (e.g. [Turing machine]s) stop there as well, although some (e.g. recursively defined functions) continue by encoding strings as Gödel numbers. That something ''can'' be encoded using simpler structures is not of much interest here. [CMCc]: yes, if one replaces the question "''what are important/desirable characteristics of composition in a language?''" to the question "''is it possible to compose in [[some given]] language?''", it does lose value. [Lars H]: Given a range of implementations of a data structure, the right question is probably rather what the computational complexities of standard operations are for these implementations. If one can find an example of a data structure where Tcl implementations have significantly worse complexity than other languages, then that is a sign that the language could do with improvements. Some historical examples: '''Lists.''' <
> In Tcl 7, there were no [Tcl_Obj]s, and thus in particular, lists had no "internal representation". Any operation on a list required that the string was reparsed, so that for example [llength] and [lindex] (on average) was linear time operations. In particular, the complexity would depend on how large (as strings) the elements of the list were, which was a significant drawback in comparison with e.g. [Lisp]. In Tcl 8, both [llength] and [lindex] are constant time operations on values that have a list internal representation (and in these circumstances it is reasonable to assume that list values are only being operated on using list operations). Now it is rather Lisp (using a linked list implementations) that is at a disadvantage, because computing the length of a linked list requires that the list is traversed. (OTOH, insertion is linear in list length in Tcl, but can be faster in Lisp.) '''Dictionaries.''' <
> In Tcl 8.4, getting and setting individual elements of a dictionary (stored as a "key value key value ..." list) is linear in the size of that dictionary, but in Tcl 8.5 the new dictionary values with a hash internal representation will be faster (anyone know how much faster? Logarithmic?). In these two cases, there was a clear gain with getting these things into Tcl, but for many of the things proposed, there is no gain! Lists are all that one needs. '''Fancier data structures aren't always better''' In the [reference] page (or thereabout) I asked whether there was ''any case at all'' of a data structure that could be implemented more efficiently (better complexity) using references (e.g. Lisp cons cells or C pointers to structs) than in current Tcl. The simplest example anyone gave was that of a dual-sorted queue: If the queue is a list then you need to search through it to find the smallest element in at least one of the sorting orders, so dequeuing is a linear time operation. If the queue is instead implemented as a linked list, where there is one "next1" reference for sorting order 1 and one "next2" reference for sorting order 2, then it is possible to find and dequeue the first element according to either sorting order in constant time. You do need to make both linked list structures dual-linked, so there is a total of four pointers per queue element, but it is still constant time. There is a "but" though: that is just the complexity of dequeuing. Any kind of system that makes use of such a queue would have to enqueue as many elements as it dequeues, and inserting a new element on the proper position into a linked list is still a linear time operation. So in the end there was no gain, and total complexity is the same regardless of whether one uses reference-based linked lists or just plain Tcl lists. Then again, it is possible to speed up insertion in the linked list case by adding a self-balancing tree structure to the thing, in which the original linked list is just the "thread" (I think it is called); this gets insertion down to logarithmic time, at the price of increasing the total number of references by another factor two (possibly more). The result is a rather huge data structure, but it is faster than simple-minded searching through a list. On the other hand, it still isn't superior to Tcl, because nothing stops us from treating a Tcl list as one in the old days sometimes treated an Pascal array as a "heap" from which we can allocate and deallocate our own dynamic memory ("pointers" then being indices into this list). Anything that can be done with references (without automagic garbage collection) can be simulated in Tcl using lists, without loss of asymptotic performance. [CMCc] '''Who said the sole metric of "better" is "more efficient"?''' Lars gives a few examples of completely encapsulated objects which benefited from a reimplementation without a change in interface. The ability to do that, to distinguish between implementation and interface is why something which permits encapsulation is "better" than something which does not. If it were not for the fact that the internal representation (and implementation) was distinct from its interface, it may well have been practically impossible to produce the more efficient implementation without disrupting users of the construct. For such a distinction to be significant, it has to be distinct and recognisable by most people. That's one of the things OO does - signals what's available for use and what's implementational, and subject to change. ''[[[DKF]: what is meant by "implementational"? Replace with real meaning and delete this comment.]]'' [Lars H]: Your claim that my examples concern "completely encapsulated objects" is incorrect. Lists were not completely encapsulated in Tcl 7 (they could at any time be introspected using e.g. [[string index]]) and they are not completely encapsulated in Tcl 8 either (you can still peek at characters using [[string index]]). It is ''inefficient'' to operate on them using non-list commands, but it is perfectly possible. Come to think of it, this may be what object-oriented programmers have hard to understand about [everything is a string] (or at least ''one'' thing that makes them confused): A good data structure need ''not'' necessary hide the data behind some opaque API -- it is sufficient that the public representation (the string) of the data can be lazily generated. Thinking of data structures as objects instead promotes the view that as little as possible should be made public. ((interpolated rebuttal: [CMCc] - I take it amiss when I'm told what I don't understand. Recently, on the wiki, I started talking about things which aren't strings ([proc]s, arrays, etc) and Miguel was kind enough to give me the full version: everything is representable by a string. Every value is serializable and constructable to a string. Additionally, list (which is the example given by [Lars H]) is implemented as an array of pointers in C, but could be implemented as a linked list (for example) and this would not change its interface. List implementation '''is''' completely hidden from the user, as are all of the other basic data types. If it were not so, it would be possible to write code which broke when the implementation changed from the current array-of-pointers to linked-list. It is not possible.)) [NEM] I've been giving this some thought recently -- the seeming contrast between transparent values and opaque objects/values. Tcl's built in list implementation is based on a string representation, but if I was to write a new list implementation using one of the many [OO] packages, it would be an opaque handle. The reason for this opaque handle is that in order to support the OO [[object method args]] notation (and the polymorphic method dispatch this allows) the "object/value" must be a command name. It would thus be pretty inconvenient to change this command name every time the data changed -- not to mention the risk of name clashes. This has implications for how the value is treated compared to built-in [list]s/[dict]s etc. Values in Tcl are often said to be typed according to how command treat them (there is no list ''type'', merely some commands which attempt to treat a subset of strings as lists). However, OO extensions reverse this by making the value itself a command name, hiding the actual data behind the scenes, and laying down which commands can be applied to the value. So, in that sense, a certain amount of typing has been introduced by these extensions. Is this a bad thing? OO advocates would argue that it is a very good thing -- hiding implementation details behind an interface make codes maintenance easier, and helps robust code reuse. I can certainly see the logic in this, and I've begun to wonder whether transparent data types are actually a good idea. We've all seen the problems that people get into with using string functions on lists, and it seems advice generally given these days on how to get good performance, and best use out of values in Tcl is to not mix commands which expect different "types" - always use list functions on lists, try to avoid shimmering, etc. So, on the one hand we have the argument that transparent values make Tcl so powerful, and you can do wonderful things with them, and on the other, we have the wisdom; '''Don't Do That'''. Note, also, that there is a confusion between the terms "[everything is a string]" and "transparent data types". Opaque data-types (handles) are still represented as a string, but their internal data structure need not be (could be implemented in C) convertible to a string. So, at the Tcl level, every value you see is still a string, just that sometimes that isn't the whole story. This is not a new problem introduced by OO extensions - [chan]nels, [image]s, [namespace]s, commands themselves and more are all opaque handles. Many provide introspection capabilities to allow serialisation, but some do not, and generally none do so in a consistent manner. So, what are the advantages of transparent data types? Natural serialisation is one: just write the value to a file and read it in later and you can exactly recreate it - no explicit serialisation/deserialisation to perform. I don't see this as particularly convincing argument. You could just as easily make "everything an object", and just mandate that each object provides routines to convert to/from a string -- then the input/output commands simply call these methods as needed. What are the advantages of making the implementation details of a value (the string representation) a part of the interface (indeed, the only interface)? Is Tcl unique in doing this? (''Discussion moved to bottom of page - marked with ***'') [NEM] Ok, I'm going to retract that statement about serialisation - it's not quite as easy in the OO approach as I made out. The problem comes from things like: puts "This is my object: $obj" Here, you need some way of finding the reference to $obj and replacing it with [[$obj tostring]] transparently. Not impossible: you could do the serialisation in the variable dereferencing code (rather than in a channel handler), but this breeds more problems: set obj2 $obj $obj is serialised and then deserialised, rather than the opaque handle being copied - so what is the "type" of the new object? We have returned to a transparent value here, basically. This is a much more interesting problem than I had thought - and has rekindled my appreciation of transparent values. I need to think about this some more... [NEM] I just had an idea too: has anyone ever tried to combine the two approaches? I mean, create an OO package where the string-rep was something like: [list dispatch-cmd data...] (([CMCc] check out [namespace] ensemble for something along those lines - it makes a namespace a command which can be used to dispatch other stuff)) dispatch-cmd could be a command for the "class" of the object (assuming class-based OO), and the data bit could be a transparent representation of the data contained in the object (say a dict). Would require auto-expansion of leading word to work... Just an idea. [NEM] Looking at this idea again, I see it has a problem of this prefix stuff. For instance, it still doesn't solve the problem of being able to use an OO toolkit to write a compatible replacement for Tcl's built in lists - e.g. something that could be used with [foreach] transparently. [NEM] *** Moved this to the bottom for people skimming for changes: ''[Lars H], later insertion: The serialisation/deserialisation issue may not sound like much in a theoretical discussion, but if you frequently communicate structured data then it is a fantastic feature! The practical advantage is HUGE.'' [NEM] (later still): Indeed - the practical benefits of such simple serialisation are well known, and very real. However, I'm just trying to think about what the trade-offs are. It seems that to have this natural serialisation (and powerful introspection) we have discarded most of the benefits of encapsulation - we have laid bare the very implementation of the data structure. For example, suppose if at some point somebody (a madman clearly, but anyway) decided to change the string format of lists (let's say they decide items will be separated by commas instead of whitespace, for easy production of [csv] files). Now if you send a list from a current Tcl, and try to read it into this new whizz-bang modern Tcl, it will probably fail (or at least, it is unlikely to reproduce the same list). There is no versioning in the string representation, either (but, of course, there is versioning in the Tcl package which implements it). So, we enforce encapsulation through a lot of saying (and documenting) "Don't Do That". Is this a good situation? Are the trade-offs acceptable? Is there a better way? Is there a way to say which data-types would be better as transparent, and which would be better as (encapsulated) opaque handles? And is relying on natural built-in serialisation a good idea for communication - should you really be designing/using a special protocol for it? [NEM] See also [transparent] and [handle] pages which discuss these two representations. Basically, transparent values are first-class, while handles are more like pointers or references (the handle is first-class, but the thing it ''points'' to is not). [FB]: I'm trying to address the [transparent] vs [handle] dichotomy in [Cloverfield]. Transparent values are IMHO one of the major strengths of Tcl. However there are cases where this falls short, because Tcl lists can easily represent trees but not graphs. IOW, graphs cannot be expressed as a single transparent value in Tcl without in-band signaling (which is considered A Bad Thing most of the time in CS, see [http://c2.com/cgi/wiki?InBandSignal] vs. [http://c2.com/cgi/wiki?OutOfBand], return codes vs. exceptions, zero or {} vs. null, etc.), for example by enforcing some naming convention. But in this case you add some computational complexity, or are forced to supplement the value with external data. Let's consider the following graph: ====== A--B--C |\ | / | \|/ D--E--F ====== A possible Tcl definition would be: ====== { A {B D E} B {A C E} C {B E} D {A E} E {A B C D F} F {E} } ====== Here the graph is expressed as a dictionary whose keys are the node names and the values are lists of connected nodes. This case is simple because of the naming convention. Now let's imagine that we want to allow changing the name of a node. This implies changing the key as well as all instances of the name in the lists. To prevent that we have to choose another convention by giving each node a constant identifier: ====== { 1 {A {2 4 5}} 2 {B {1 3 5}} 3 {C {2 5}} 4 {D {1 5}} 5 {1 2 3 4 6}} 6 {F {5}} } ====== Here we uncoupled the name from the identifier, and stored the latter in the lists. That means that we are now free to change the names of the nodes, but we can no longer use this name as an index into this structure, we are forced to use arbitrary identifiers as dictionary keys. To keep this possibility, we need to maintain a supplementary structure, here a reverse map. Moreover the identifiers are totally meaningless with respect to the data held by the structure. Identifiers should be hidden data, but they must be made transparent and explicitly used. This is a kind of in-band signaling. The other alternative is to use handles. Handles are the Tcl equivalent of pointers. Each node becomes a distinct data identified by its handle. Accessing node data needs support from specific procs, or must be properly encapsulated into an object that provide access methods. Node data can be easily changed without altering the overall structure. But what we gain on one side is lost on the other side, as the structure can no longer be expressed as a string. And we have shifted the paradigm from a declarative to an imperative style. Addressing this kind of problem needs out-of-band signalling. We need a way to express references in transparent Tcl values so that data and relationships can be uncoupled. To do so [Cloverfield] introduces the concept of word modifier. Words (in the Tcl sense) can be tagged by special modifiers using a syntax similar to the argument expansion operator. In our case, one can associate a reference identifier to a given word, and later refer back to this reference in the same structure (see also [Cloverfield], section '''References'''). This uses the prefix `{ref id}`, where `id` is an arbitrary string. Let's define our graph using Cloverfield syntax: ====== ( {ref 1}A ({ref 2}B {ref 4}D {ref 5}E) {ref 2}B ({ref 1}A {ref 3}C {ref 5}E) {ref 3}C ({ref 2}B {ref 5}E) {ref 4}D ({ref 1}A {ref 5}E) {ref 5}E ({ref 1}A {ref 2}B {ref 3}C {ref 4}D {ref 6}F) {ref 6}F ({ref 5}E) ) ====== Here each occurrence of node names are prefixed by a reference identifier (for comparison purpose we chose to reuse the same identifier values as above). These tags tell the interpreter that the associated data can be referenced later in the structure. Value wise, the above list is strictly equal to the first Tcl version above: the structure is still a dictionary that can be indexed by node names, and the lists of connected nodes hold references to this names rather than value, however `[foreach]` will only "see" the names; changing a node name doesn't break this relationship, because there is no longer an explicit naming convention, and the purposed served by the identifiers in the second Tcl version is now expressed as metadata. This is out-of-band signalling, and it allows keeping all meaningful data in a single structure that can be expressed as a string. For this example I've simply taken the first Tcl version and prefixed each name with its matching ID. ''(Note: There are better ways to express graphs using this syntax but I chose this one for comparison purpose)''. This mechanism is not really meant for direct use in scripts (although its usage is fairly easy), but rather to enable the serialization of circular referencing structures that can't be expressed as trees. For example an application could build a list with complex reference patterns, such as Scheme-style "infinite" lists, and not worry about its string representation, whereas such structures would typically be impossible to create from Tcl, and generate an infinite loop when building the string representation of a self-referencing list created from C (which nothing prevents in the current version). Tcl is meant to be a glue language, and to do so data exchange must be expressed as strings and formatted following specific conventions. However some structures simply cannot be expressed that way without a lot of hassle, and this proposal tries to address this problem. I think the lack of expressive power in languages is one of the major reasons why [OOP] is so popular. In the case of XML, compare for example [DOM]-based access (imperative) versus [XPath] queries (declarative): large portions of the Document Object Model become almost totally useless because a single XPath query can replace a complex sequence of DOM statements. And [SQL], while still free of any OO concept, remains the best way to access relational databases, because no sane person wants to express database relationships in an imperative language. [DKF]: The traditional Tcl way of handling such complex structures is (regardless of the opacity of the value) to provide a command (or commands) that talks the users' language and performs the operations on whatever the structure actually is. I encourage you to stick to that paradigm (however it works in practice when mapped to what you're doing). ---- [NEM] The problem with lists is that there are two levels of implementation - the implementation of the efficient internal representation behind the scenes (by the list Tcl_ObjType and associated gubbins--''like this[http://en.wikipedia.org/wiki/Gubbins_band]?''), and the implementation of the conversion from the string representation to this internal representation. You are right, in that the ''internal'' internal representation is completely encapsulated, and could be changed to a linked-list without breaking anything. I was referring, however, to the implementation in terms of string representation, which certainly would break things if it changed. It's this representation which is more important at the script level, as it is the only representation which is has any level of guarantee of surviving shimmering etc. So the string rep is an encoded form of (part of) the interface which the underlying implementation has to honour. Does that make sense? [CMCc] - Makes sense to me. It's a shame [Tcl_Obj] wasn't called something different to save confusion. Yes, struct Tcl_Obj defines ''objects'' at the C language level, but it's an impoverished/minimalist object interface which, by design, only concerns itself with construction and conversion to/from a string representation. Nothing wrong with that, but it hardly impacts on the visible tcl language at all (if at all) which is evidenced by the way we think about [shimmering]. The semantics of conversion to and from string representation is the explicit concern of Tcl_Obj, it's the C interface defined by Tcl_Obj, and Tcl_Obj is (by design) silent with regard to other interface functionality and is in any case invisible to the Tcl language level. The only Tcl interfaces to things supported by Tcl_Obj are command wrappers. Several significant entities in Tcl aren't even represented by Tcl_Obj (e.g. open files) and several of the significant entities which are represented by Tcl_Obj aren't visible to the Tcl language anyway. [DKF]: They should have been ''Tcl_Value'' IMO, but that datatype was taken (for user-defined functions in [expr], since you ask). [FB]: I'd suggest ''Tcl_Word''. The [Dodekalogue] designates values as words. [DKF]: That'd probably get confused with 'machine word'. Sometimes you're just stuck. [FB]: I think I could live with that confusion ;-). Confusion with objects is more problematic IMHO. [NEM] 2008-01-24: ''Tcl_Term''? ---- !!!!!! %| [Category Data Structure] |% !!!!!!