---- Title: Complex data structures **Problems** [Arjen Markus] Quite often the question arises: how to model complex data structures? As Tcl allows various solutions, it seemed to me a good idea to start a new page on this subject. ''[escargo] 17 Jan 2003'' - Just for the sake of illustration, could you give some examples of these ''complex data structures''? Are we trying to talk about AVL trees, red-black trees, [trie]s, DAGs, petri nets, bags, or what all else? [AM] (24 january 2003) When I set up this page, I was thinking of C [struct]s, Fortran [array]s and other "passive" data structures. Note that the [Tcllib] contains a number of modules for managing data structures - [matrix], [tree], ... [awalczak] I deal with tcl code generated by idl2tcl. In my case IDL code is modeling struct with other structs and collections inside. I would be very thankfull for example how to build in TCL a big struct-like construct step-by-step (having smaller units defined earlier and combining them together before return form a function). [RBTree] is a Tcl extension that implements Red-Black trees, having features of an ordered list and direct access as array. - [RS] found explanations at [http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black.html], but from there I gather it's just well-balanced binary trees, and on insertion are (by possible rotation) brought back to well-balancedness. [DRH] - Complex data structures can be modeled using an in-memory [SQLite] database. (Or use a standard disk-resident database if persistence is important.) Think of each table as a class and each row as a structure instance or object. With this in mind, it is trivial to build arrays, queues, stacks, or graphs of objects. ''[escargo] 23 Sep 2005'' - Possible, yes, but not necessarily efficient. If what I want is a Tclish version of [Judy arrays] [http://judy.sourceforge.net/], or tries ([http://www.nist.gov/dads/HTML/trie.html], [http://www.webreference.com/js/tips/000318.html], [http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/]), or PATRICIA tree [http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/], I lose all the specialized performance advantages if I have to build them out of the heavy plumbing of an in-memory database. [DRH] replys: escargo seems to confuse abstract data structures (ex: arrays, bags, DAGs, queues, etc) with their implementations (red/black trees, tries, hash tables, linked lists, etc). The Zen of Tcl is that you don't worry about the details of implementation, you focus instead on the higher-level aspects of the problem. An in-memory SQL database provides O(logN) search and O(N) traversal of data using a variety of underlying algorithms. The specific algorithms used (B-tree, hashing, linked lists, etc) are irrelevant to the programmer using SQL. All she cares about is that the results come up quickly. And an SQL database provides the same asymptotic performance as the exotic data structures mentioned by escargo above. The exotic data structures may be faster by a constant factor in certain scenarios. But generally speaking, that constant is measured in percent. And the speedup only applies for narrowly defined problems. If performance is so critical that I am willing to write intricate, customized code to achieve a speedup of 20 or 30%, then I'm going to code the problem in C, not in TCL. For 95% of the things that I and most other people do, where ease and generality of implementation and simplicity of maintenance are the driving concerns, then a generalized abstraction such TCL's built-in arrays or an in-memory SQL database is a far better choice than trying to employ an exotic data storage and retrieval algorithm. ''[escargo]'' - I can't program with abstract data structures; I need to use their implementations. (And I do understand the differences between policy and mechanism, too.) I never said that the data structures would be implemented in Tcl. I fully expect that the implementations would be in [C] (or C++) with a wrapper that allows Tcl to use them. An in-memory SQL database seems far more exotic to me than Tcl access to an existing C library for Judy arrays, etc. I certainly believe that the KISS principle applies. That '''constant k''' that differentiates performance between an ideal implementation of a complex data structure and one layered on an ''in-memory SQL database'' can be significant. And there are times when I have a program that is in the 5% where such things matter. (Also, I'm aware that writing a simpler but naive implementation and profiling it to see where it's slow is better than picking a complex algorithm just to speed up part of a program guessing that you know ahead of time where the bottlenecks will be.) ''[L]'' - I agree wholeheartedly with escargo. It is nice in many cases to ignore the "details of implementation and focus on the higher-level aspects of the problem". However, sometimes this is impossible. Example: I have a 15GB text file that I want to index by creating a trie of all the words in it to record at which offset in the file this word first occurs. Because I know that the same word will be repeated a million times ''[kanryu] 20080608'' - huddle is a complex container with abstract container algorithms (such as dict and list) which can contain multirank nodes. ---- **Possible solutions** ordered roughly after how "exotic" they are, taking into account requirements, user base, and history: * Use [array]s (associative arrays allow indices that are strings, rather than mere numbers or [tuple]s of numbers) * Use [dict]s (quite similar functionally to arrays) * Use tagged lists * Use [namespace]s * Use key–value lists * Use struct::tree (in tcllib) * Use [TclX] [keyed list]s * Use one of the object-oriented extensions [Larry Smith] 6/8/2008: [Lost] and [Most] are very compact, likely solutions to this problem since their overhead for object-orientation is minimal. * Use [Records] [Larry Smith] 6/8/2008 * Use [huddle] (tagged list, dict, or other containers)s ---- ''Pros and cons of the object-oriented extensions:'' * Pro: Natural and intuitive to anybody who has done any OO. * Con: Not really a solution by itself (unless all one wants is a C-style struct), but rather a framework within which solutions can be more elegantly implemented. * Con: At the moment garbage collection can't be applied to these objects so you need to take care of destruction yourself. * Con: Objects are not first-class data items. * Con: Requires an external package. ---- ''Pros and cons of TclX keyed lists:'' * Pro: a [keyed list] can handle nested data structures. For example, the keyed list "{ID 106} {NAME {{FIRST Frank} {LAST Zappa}}}" represents the following hierarchical structure: ID = 106 NAME: FIRST = Frank LAST = Zappa * Pro: keyed lists are first-class data items. They can be passed in and out of procs like any other list or string. * Pro: it is possible to query the existence and value of a key in a single command, without the speed drawback of [catch]. * Con: the access syntax is more verbose than for arrays. * Con: access to nested elements requires multiple commands. * Con: requires an external package. ---- ''Pros and cons of arrays:'' * Pro: by one name, you gather all the pieces of information set personal_data($name,birthday) "..." set personal_data($name,telnum) "..." ... * Pro: easy access to all the data pieces. * Pro: implemented internally by hash tables, hence access is fast. * Con: arrays are not "first-class" variables in Tcl, you will have to use [upvar] or [array] get and set commands to transfer them between procedures. * Con: essentially only one level of structuring. Otherwise you would have to use structured indices to store the values. ---- ''Pros and cons of dicts:'' * Pro: in one value, you gather all the pieces of information set personal_data [dict create name [dict create FIRST "Frank" LAST "Zappa"] ID 106] dict set personal_data telnum $phone dict get personal_data name FIRST ; # => Frank ... * Pro: easy access to all the data pieces. * Pro: implemented internally by hash tables, hence access is fast. * Pro: dicts are "first-class" data items in Tcl, you can freely pass any dict value to any proc * Pro: support for multiple levels of structuring, by putting dicts in dicts. * Con: It is available in an unstable release (8.5a4 currently) ''Also available for 8.4 as a compiled extension [[link, anyone?]], and not too hard to emulate in pure Tcl either -- see [forward-compatible dict]'' ---- ''Pros and cons of tagged lists:'' * Pro: tagged lists {NAME ... {BIRTHDAY ...} {TELNUM ...} ...} are simple to use, they are first-class variables (they are simply a form of list variables). They can have any level of substructure. * Pro: the format lends itself to the [data is code] style of processing. * Con: replacing values deep inside such a structure can be a nuisance. (But this may force you to use constructor and accessor procs to do this - nice modular design). * Con: because of the need for linear or binary searching, tagged lists will be slower than arrays for large data structures. The comparison on this page is in a sense unfair to tagged lists, since it focuses on table-style data structures. While tagged lists ''can'' be used to implement such things, their primary use-case is rather for tree-like data structures. ---- ''Pros and cons of namespaces'': * Pro: separate variables into independent groups, all variables are first-class. No special syntax required. * Pro: implemented internally with hash tables, for O(1) access. * Con: if you need many namespaces it can be difficult to maintain the application. * Con: variables will exist independent of any procedure, so they have a non-local scope. Again, maintenance problems will arise. ''What do you mean by namespaces here? I can think of two uses: One simply as a way to group a bunch of variables, and the other with included procs to act more like an object. Perhaps both approaches need entries?'' ---- ''Pros and cons of key–value lists:'' A '''key–value''' list is a list where every even index item is a "key" (arbitrary identifier string) and the following (odd index) item is the value associated with that key. * Pro: a key–value list can handle nested data structures. For example, the keyed list "ID 106 NAME {FIRST Frank LAST Zappa}" represents the following hierarchical structure: ID = 106 NAME: FIRST = Frank LAST = Zappa * Pro: key–value lists are first-class data items. They can be passed in and out of procs like any other list or string. * Pro: key–value lists do not require any Tcl extensions. * Pro: The '''[[array get]]''' and '''[[array set]]''' subcommands use key–value lists natively. * Pro: Forward compatible with [dict]s (key–value lists have the same string representation as the corresponding dict). * Con: before Tcl 8.5, there were '''no''' commands for directly accessing data stored in key–value lists. The most convenient way of accessing information in a key–value list is to unroll it into an array. (In practice, key–value lists spend a lot of time unrolled into arrays.) Typical code might look like proc update_KVL {KVL some more args} { array set A $KVL # Do the update return [array get A] } It is often convenient to use a key–value list for storing "the settings". Consider proc do_something {settings_KVL args} { # Store defaults into array A array set A $settings_KVL; # Override with specific settings # Use data in A when doing something } 12may03 [jcw] - FWIW, there's a little "ihash" package in [critlib] coded in C (and I have a minimal pure Tcl script for doing the same somewhere), which makes key-value lists (which I used to call interleaved lists) in the same performance order as arrays, because it maintains a hash table alongside. This is transparent, but you have to be careful to avoid shimmering or things will slow down as the hash table gets rebuilt way too often. Probably also some more info, search for ihash on the wiki. ---- Of course it is possible to combine these solutions - to get the best (or worst) out of them. ---- [Ken ]Is there other ways in create C-like structure besides TclX keyed lists; for example like the "record" below indicated by Brett. Regarding the solution you given, can you kindly explain how to get it working cause when i tried the code, it failed to recognize the package. And when i tried to use TclX keyed lists, tclsh8.4 fail to recognize the commands. How to get TclX working under Tcl? And is TclX part of Tcl? if yes, why is it separate? Sorry new to this language. ''[Lars H]: [Tclx] is a [package]. You need to do '''package require Tclx''' to load it. But [dict]s are probably a better choice than keyed lists nowadays.'' [Ken ] Another question: How do i create an array of structures? Is it there any way of doing it besides manually naming it directly for example: set ArrayStructure(0.value) 1 set ArrayStructure(1.value) 2 [Lars H]: Depends on how you implement your "structures". Maybe [C struct is Tcl!] will give you some ideas? But if you're new to Tcl then you're likely to be thinking in too [C]-ish terms about these matters. Perhaps a list of structs would suffice as good? Also take a look at the [struct] package. ---- [RS] 2002-07-05 - In the [Tcl chatroom] we discussed the potential of nested lists for data storage, given the new "multi-dimensional" [lindex] and [lset] starting from 8.4. [Steve Landers] wished for structures so list elements could be addressed with meaningful strings rather than positional indices only. Here's a nice minimal solution that I came up with: proc struct {name elements} {interp alias {} $name {} lsearch $elements} #... used to define a "struct" (map from names to positions): struct staff-> {firstname lastname birthdate phone} #... used to abstract such details in coding accesses: lset database $employeeID [staff-> phone] 123-4567890 Letting the struct name end in "->" is cute sugar. Note however that after it (and not before it) there must be whitespace. ''Nother way out would be to use "staff:" -[jcw]'' [RS] Yep - see [Implementing enumerated types] for a two-way map that generates both "staff:" and "staff@" for the inverse direction. [Brett Schwarz] - humm, I just found this page. I actually put a package in tcllib that does what [RS] is talking about (essentially). It is called '[record]', and is located under the 'struct' module in tcllib. It is only in CVS now (as of Jan 2003), but should go into next tcllib release. Here is how you would use it: package require struct namespace import struct::* # define records record phones { home work cell } record addr { name street {country USA} {record phones phlist} } # instantiate addr inst1 # assign members inst1 configure -name Brett -street main # get a values (returns list) inst1 cget -name -street # you also use the dot notation set name [inst1.name] inst1.street = Main # the phlist is a nested record, and is accesses like this: inst1 configure -phlist.home 425-555-1212 inst1.phlist.work = 206-555-1212 # following would return a list of the phones (home work cell) set phones [inst1.phlist] Note the 'country' member. I assigned it a default value of "USA". Anyways, you get the picture. See the docs in CVS for more info... ---- [George Peter Staplin] I couldn't resist writing an example of doing that with my [megapkg] structure extension. load ./structure.so proc phone {} { set p [structure] # Initialize the phone record $p -home "" -work "" -cell "" return $p } proc addr {} { set addr [structure] $addr -name "" -street "" -country USA -phoneNumbers [phone] return $addr } set addr [addr] # This prevents setting new keys by accident lock.structure.creation $addr $addr -name "John Doe" -street "Logan" [$addr -phoneNumbers] -home 555-1234 -work 555-7812 puts "The person's name is [$addr -name]. He lives on [$addr -street]. His work number is [[$addr -phoneNumbers] -work]." The - in a key is just some added sugar. Any data can be used for a key (even binary data). ---- ''[escargo] 30 Aug 2005'' - Are there any extensions or packages for implementing a ''trie'' in [Tcl]? (I didn't say ''tries'' because that's too easy to hit when searching.) ---- [unless] ---- !!!!!! %| [Category Data Structure] |% !!!!!!