Title: Complex data structures

**Problems**

[Arjen Markus] Quite often the question arises: how to model complex data structures? As Tcl allows various solutions,
it seemed to me a good idea to start a new page on this subject.

''[escargo] 17 Jan 2003'' - Just for the sake of illustration, could you give some examples
of these ''complex data structures''?  Are we trying to talk about AVL trees, red-black trees,
[trie]s, DAGs, petri nets, bags, or what all else?

[AM] (24 january 2003) When I set up this page, I was thinking of C [struct]s, Fortran [array]s and other "passive" data structures. Note that the [Tcllib] contains a number of modules for managing data structures - [matrix], [tree], ...


[awalczak] I deal with tcl code generated by idl2tcl. In my case IDL code is modeling struct with other structs and collections inside. I would be very thankfull for example how to build in TCL a big struct-like construct  step-by-step (having smaller units defined earlier and combining them together before return form a function).

[RBTree] is a Tcl extension that implements Red-Black trees, having features of an ordered list and direct access as array. - [RS] found explanations at [http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/red_black.html], but from there I gather it's just well-balanced binary trees, and on insertion are (by possible rotation) brought back to well-balancedness.

[DRH] - Complex data structures can be modeled using an in-memory [SQLite] database.  (Or use a standard disk-resident database if persistence is important.)  Think of each table as a class and each row as a structure instance or object.  With this in mind, it is trivial to build arrays, queues, stacks, or graphs of objects.

''[escargo] 23 Sep 2005'' - Possible, yes, but not necessarily efficient.  If what I want is a Tclish version
of [Judy arrays] [http://judy.sourceforge.net/], or tries ([http://www.nist.gov/dads/HTML/trie.html],
[http://www.webreference.com/js/tips/000318.html], [http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/]),
or PATRICIA tree [http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/], I lose all the specialized
performance advantages if I have to build them out of the heavy plumbing of an in-memory database.

[DRH] replys:  escargo seems to confuse abstract data structures (ex: arrays, bags, DAGs, queues, etc) with
their implementations (red/black trees, tries, hash tables, linked lists, etc).  The Zen of Tcl is that
you don't worry about the details of implementation, you focus instead on the higher-level aspects of
the problem.  An in-memory SQL database provides O(logN) search and O(N) traversal of data using a
variety of underlying algorithms.  The specific algorithms used (B-tree, hashing, linked lists, etc)
are irrelevant to the programmer using SQL.  All she cares about is that the results come up quickly.
And an SQL database provides the same asymptotic performance as the exotic data structures mentioned
by escargo above.  The exotic data structures may be faster by a constant factor in certain
scenarios.  But generally speaking, that constant is measured in percent.  And the speedup only
applies for narrowly defined problems.  If performance is so critical that I am willing to write
intricate, customized code to achieve a speedup of 20 or 30%, then I'm going to code the problem
in C, not in TCL.  For 95% of the things that I and most other people do, where ease and generality
of implementation and simplicity of maintenance are the driving concerns, then a generalized abstraction
such TCL's built-in arrays or an in-memory SQL database is a far better choice than trying to
employ an exotic data storage and retrieval algorithm.

''[escargo]'' - I can't program with abstract data structures; I need to use their implementations.
(And I do understand the differences between policy and mechanism, too.)
I never said that the data structures would be implemented in Tcl. I fully expect that the implementations
would be in [C] (or C++) with a wrapper that allows Tcl to use them. An in-memory SQL database seems far more
exotic to me than Tcl access to an existing C library for Judy arrays, etc.  I certainly believe that
the KISS principle applies.  That '''constant k''' that differentiates performance between an ideal
implementation of a complex data structure and one layered on an ''in-memory SQL database'' can be significant.

And there are times when I have a program that is in the 5% where such
things matter.  (Also, I'm aware that writing a simpler but naive
implementation and profiling it to see where it's slow is better than
picking a complex algorithm just to speed up part of a program guessing
that you know ahead of time where the bottlenecks will be.)

''[L]'' - I agree wholeheartedly with escargo.  It is nice in many cases to ignore the "details of implementation and focus on the higher-level aspects of the problem".  However, sometimes this is impossible.  Example: I have a 15GB text file that I want to index by creating a trie of all the words in it to record at which offset in the file this word first occurs.  Because I know that the same word will be repeated a million times

''[kanryu] 20080608'' - huddle is a complex container with abstract container algorithms (such as dict and list) which can contain multirank nodes.
----

**Possible solutions for implementing complex data structures**

Ordered roughly after how "exotic" they are, taking into account requirements, user base, and history:

   * Use [array]s (associative arrays allow indices that are strings, rather than mere numbers or [tuple]s of numbers)

   * Use [dict]s (quite similar functionally to arrays)

   * Use tagged lists (see explanation below)

   * Use [namespace]s

   * Use key–value lists (essentially the same as [dict]s, but one avoids using the [dict] command; mostly of interest for backwards compatibility)

   * Use struct::tree (in [tcllib])

   * Use [TclX] [keyed list]s

   * Use one of the [object-oriented] extensions

   * Write a specialized [C] function to implement a new kind of Tcl object

   * Use [SWIG] to create [glue] between some commonly used [C] library and Tcl

   * Use [Records] - [Larry Smith] 6/8/2008 

   * Use [huddle] (or tagged list, dict, or other containers)


----

''Pros and cons of the [object-oriented] extensions:''

When you have a bunch of related data, collect it in an object (e.g. separate fields are separate instance variables).

[Larry Smith] 6/8/2008: [Lost] and [Most] are very compact, likely solutions to this problem since their overhead for object-orientation is minimal.

   * Pro: Natural and intuitive to anybody who has done any OO.

   * Con: Not really a solution by itself (unless all one wants is a C-style struct), but rather a framework within which solutions can be more elegantly implemented.
   * Con: At the moment garbage collection can't be applied to these objects so you need to take care of destruction yourself.
   * Con: Objects are not first-class data items.
   * Con: Requires an external package or at least Tcl 8.6a2 (for [TclOO]).

----

''Pros and cons of [TclX] keyed lists:''

   * Pro: a [keyed list] can handle nested data structures. For example, the keyed list "{ID 106} {NAME {{FIRST Frank} {LAST Zappa}}}" represents the following hierarchical structure:

     ID = 106
     NAME:
        FIRST = Frank
        LAST = Zappa

   * Pro: keyed lists are first-class data items.  They can be passed in and out of procs like any other list or string.

   * Pro: it is possible to query the existence and value of a key in a single command, without the speed drawback of [catch].

   * Con: the access syntax is more verbose than for arrays.
   * Con: access to nested elements requires multiple commands.
   * Con: requires an external package.

----

''Pros and cons of [array]s:''

Separate fields are separate entries in an array (a data structure which has been in the standard Tcl distribution for quite some time).

   * Pro: by one name, you gather all the pieces of information

     set personal_data($name,birthday) "..."
     set personal_data($name,telnum) "..."
     ...

   * Pro: each piece of data resides in a separate variable, so access is easy.

   * Pro: traces can be set on individual elements, which makes it possible to (e.g.) use array elements as [Tk] ''-textvariable''s.

   * Pro: implemented internally by hash tables, hence access is fast.

   * Con: arrays are not "first-class" data in Tcl, you will have to use [upvar] or [array] get and set commands to transfer them between procedures (or provide namespace qualification, but then the arrays cannot be local).

   * Con: essentially only one level of structuring. Otherwise you would have to use structured indices to store the values.

[AIN] (Array In Namespace) is a package for this kind of data structure, providing commands for e.g. copying an array and generating names for arrays.

----

''Pros and cons of [dict]s:''

dicts became available in Tcl 8.5.

   * Pro: in one value, you gather all the pieces of information

     set personal_data [dict create name [dict create FIRST "Frank" LAST "Zappa"] ID 106]
     dict set personal_data telnum $phone
     dict get personal_data name FIRST ; # => Frank
     ...

   * Pro: easy access to all the data pieces.

   * Pro: implemented internally by hash tables, hence access is fast.

   * Pro: dicts are "first-class" data items in Tcl, you can freely pass any dict value to any proc

   * Pro: support for multiple levels of structuring, by putting dicts in dicts.

   * Con: Comparatively new (Tcl 8.5) addition to the language, but also available for 8.4 as a compiled extension [[link, anyone?]], and not too hard to emulate in pure Tcl either -- see [forward-compatible dict]

   * Con: need to use [[[dict]]] subcommands to manipulate contents, unlike with arrays where contents are manipulated using "ordinary" commands.  Sometimes elements have to be unpacked, manipulated, then placed back into the dict.  Example: there is no [[[dict] [lset]]].

   * Con: nested dicts aren't "natively" supported by all [[[dict]]] subcommands, so they sometimes have to be unpacked, manipulated, then placed back into the dict (similar to the previous "con").  Example: [[[dict incr]]] does not allow a key "path" whereas [[[dict get]]] does.

   * Con: the order of elements is not preserved.  This is usually not a problem, but sometimes the order conveys significant information.  In this case an ordinary key-value list can be used, which is suitable for use with [[[foreach] {key value}]], and [[[dict get]]] can be used to extract values by key name.  (Note: there is a performance penalty for doing this.)  If each element of the "ordered" dict is itself a dict, a sort_order key can be added to it, but actually sorting the elements is non-trivial, since [[[lsort] -index]] cannot be used due to the random ordering of the '''nested''' dicts.  Shrug!

----

'''Tagged lists'''

A tagged list is a list where the first element is a "type name" and subsequent entries contain data. Each field has a separate position in the list, but the meaning of a list element may depend on the "type name"; in [C] terms, the data type of a tagged list is a union of structs. [[So, where does one find an implementation of this data structure?]]

''Pros and cons of tagged lists:''

   * Pro: tagged lists {NAME ... {BIRTHDAY ...} {TELNUM ...} ...} are simple to use, they are first-class data (they are simply a form of lists).

   * Pro: tagged lists can have any level of substructure, and are especially suited for tree-like data structures.

   * Pro: the format lends itself to the [data is code] style of processing.

   * Con: replacing values deep inside such a structure can be a nuisance. (But this may force you to use constructor and accessor procs to do this - nice modular design).

   * Con: because of the need for linear or binary searching, tagged lists will be slower than arrays for large data structures.

The comparison on this page is in a sense unfair to tagged lists, since it focuses on table-style data structures. While tagged lists ''can'' be used to implement such things, their primary use-case is rather for tree-like data structures. For an application of this, see [A little XML parser].

The nice thing about this wiki is that it is relatively simple to add more content. If the existing comparison seeems unfair, feel free to add examples that demonstrate the additional functionality. It appears, to this reader, that the page is not intent on being an advocate for one thing over another, but instead providing the means to describe a data structure's benefits and detriments, so that someone can more easily determine when the appropriate times are to make use of them.

----

''Pros and cons of [namespace]s'':

When you have a bunch of related data, collect it as variables in a particular namespace. Separate data sets require separate namespaces.

   * Pro: each field is a separate variable; no special syntax required.

   * Pro: implemented internally with hash tables, for O(1) access.

   * Pro: the data structure can nested by nesting namespaces (children of a node resides in children of the namespace of that node).

   * Con: namespaces are not first-class data; they can only be passed by reference (name of namespace).

   * Con: if you need many namespaces it can be difficult to maintain the application.

   * Con: variables will exist independent of any procedure, so they have a non-local scope. Again, maintenance problems will arise.

   * Con: variables that are in some, but not the current, namespace can be awkward to access.

''What do you mean by namespaces here? I can think of two uses: One simply as a way to group a bunch of variables, and the other with included procs to act more like an object. Perhaps both approaches need entries?''

----

''Pros and cons of key–value lists:''

A '''key–value''' list is a list where every even index item is a "key" (arbitrary identifier string) and the following (odd index) item is the value associated with that key.

   * Pro: a key–value list can handle nested data structures. For example, the keyed list "ID 106 NAME {FIRST Frank LAST Zappa}" represents the following hierarchical structure:

     ID = 106
     NAME:
        FIRST = Frank
        LAST = Zappa

   * Pro: key–value lists are first-class data items.  They can be passed in and out of procs like any other list or string.

   * Pro: key–value lists do not require any Tcl extensions.

   * Pro: The '''[[array get]]''' and '''[[array set]]''' subcommands use key–value lists natively.

   * Pro: Forward compatible with [dict]s (key–value lists have the same string representation as the corresponding dict).

   * Pro: ordering of elements is preserved, which is useful for some applications.

   * Con: before Tcl 8.5, there were '''no''' commands for directly accessing data stored in key–value lists.

The most convenient way of accessing information in a key–value list is to unroll it into an array. (In practice, key–value lists spend a lot of time unrolled into arrays.) Typical code might look like
  proc update_KVL {KVL some more args} {
      array set A $KVL
      # Do the update
      return [array get A]
  }
It is often convenient to use a key–value list for storing "the settings". Consider
  proc do_something {settings_KVL args} {
      # Store defaults into array A
      array set A $settings_KVL; # Override with specific settings
      # Use data in A when doing something
  }

12may03 [jcw] - FWIW, there's a little "ihash" package in [critlib] coded in C (and I have a minimal pure Tcl script for doing the same somewhere), which makes key-value lists (which I used to call interleaved lists) in the same performance order as arrays, because it maintains a hash table alongside.  This is transparent, but you have to be careful to avoid shimmering or things will slow down as the hash table gets rebuilt way too often.  Probably also some more info, search for ihash on the wiki.
----

Of course it is possible to combine these solutions - to get the best (or worst) out of them.

----

[Ken ]Is there other ways in create C-like structure besides TclX keyed lists; for example like the "record" below indicated by Brett. Regarding the solution you given, can you kindly explain how to get it working cause when i tried the code, it failed to recognize the package. And when i tried to use TclX keyed lists, tclsh8.4 fail to recognize the commands. How to get TclX working under Tcl? And is TclX part of Tcl? if yes, why is it separate? Sorry new to this language.
''[Lars H]: [Tclx] is a [package]. You need to do '''package require Tclx''' to load it. But [dict]s are probably a better choice than keyed lists nowadays.''

[Ken ] Another question: How do i create an array of structures? Is it there any way of doing it besides manually naming it directly for example: 
 set ArrayStructure(0.value) 1
 set ArrayStructure(1.value) 2

[Lars H]: Depends on how you implement your "structures". Maybe [C struct is Tcl!] will give you some ideas? But if you're new to Tcl then you're likely to be thinking in too [C]-ish terms about these matters. Perhaps a list of structs would suffice as good? Also take a look at the [struct] package.

----

[RS] 2002-07-05 - In the [Tcl chatroom] we discussed the potential of nested lists for data storage, given the new "multi-dimensional" [lindex] and [lset] starting from 8.4. [Steve Landers] wished for structures so list elements could be addressed with meaningful strings rather than positional indices only. Here's a nice minimal solution that I came up with:
 proc struct {name elements} {interp alias {} $name {} lsearch $elements}

 #... used to define a "struct" (map from names to positions):
 struct staff-> {firstname lastname birthdate phone}

 #... used to abstract such details in coding accesses:
 lset database $employeeID [staff-> phone] 123-4567890
Letting the struct name end in "->" is cute sugar. Note however that after it (and not before it) there must be whitespace.

''Nother way out would be to use "staff:" -[jcw]''

[RS] Yep - see [Implementing enumerated types] for a two-way map that generates both "staff:" and "staff@" for the inverse direction.

[Brett Schwarz] - humm, I just found this page. I actually put a package in tcllib that does what [RS] is talking about (essentially). It is called '[record]', and is located under the 'struct' module in tcllib. It is only in CVS now (as of Jan 2003), but should go into next tcllib release. Here is how you would use it:

 package require struct

 namespace import struct::*
 
 # define records
 record phones {
    home
    work
    cell
 }

 record addr {
   name
   street
   {country USA}
   {record phones phlist}
 }

 # instantiate
 addr inst1
 # assign members
 inst1 configure -name Brett -street main
 # get a values (returns list)
 inst1 cget -name -street
 # you also use the dot notation
 set name [inst1.name]
 inst1.street = Main
 # the phlist is a nested record, and is accesses like this:
 inst1 configure -phlist.home 425-555-1212
 inst1.phlist.work = 206-555-1212
 # following would return a list of the phones (home work cell)
 set phones [inst1.phlist]

Note the 'country' member. I assigned it a default value of "USA". 
Anyways, you get the picture. See the docs in CVS for more info...
----
[George Peter Staplin] I couldn't resist writing an example of doing that with my [megapkg] structure extension.

 load ./structure.so

 proc phone {} {
  set p [structure]
  
  # Initialize the phone record
  $p -home "" -work "" -cell ""

  return $p
 }

 proc addr {} {
  set addr [structure]

  $addr -name "" -street "" -country USA -phoneNumbers [phone]

  return $addr
 }

 set addr [addr]
 # This prevents setting new keys by accident
 lock.structure.creation $addr
 $addr -name "John Doe" -street "Logan"

 [$addr -phoneNumbers] -home 555-1234 -work 555-7812

 puts "The person's name is [$addr -name].  He lives on [$addr -street].  His work number is [[$addr -phoneNumbers] -work]."


The - in a key is just some added sugar.  Any data can be used for a key (even binary data).
----
''[escargo] 30 Aug 2005'' - Are there any extensions or packages for implementing a ''trie'' in [Tcl]?
(I didn't say ''tries'' because that's too easy to hit when searching.) [A hashing Trie in Tcl] 
----
Should [unless] be considered a complex data structure?  What other complex data structures are available? The [tcllib] distribution has graph, sets, matix, pool, prioqueue, queue, skiplist, stack, and tree.
 
----
!!!!!!
%| [Category Data Structure] | [Category Survey] |%
!!!!!!