Version 0 of A generic collection traversal interface

Updated 2006-01-14 21:12:27

NEM 14 Jan 2006 What's the best way to iterate over the elements of a collection? There are a number of different ways of doing this. The simplest is to provide some way to convert your collection to a list, e.g. by having a string representation which coincides with that of Tcl's lists, or by providing an explicit toList operation. However, while this is a simple approach, it is not the most efficient, particularly when the collection being iterated over is complex or expensive to access (e.g. a database). There are two main alternative methods of performing iteration in an effective manner:

  1. Using iterators/cursors, which are stateful pointers to a 'current' element in a collection;
  2. Using higher-order enumerator functions, such as foreach, map, filter etc.

The article "Towards the best collection traversal interface" by Oleg Kiselyov [L1 ] discusses these alternatives and makes a powerful case for the latter option: that a higher-order enumerator is the best default choice for a collecion interface. In particular, the article recommends using a left-fold operation with early termination as the enumerator function. There are a number of reasons for this decision:

  • Firstly, the actual traversal code is written by the author of the collection, and so can make use of internal knowledge of the collection for efficiency, without exposing any of these details to clients of the collection;
  • Related to the above point, the enumeration code completely encapsulates the traversal, and so can take care of allocating and releasing any needed resources properly, even in the presence of exceptions;
  • Writing an enumeration function is generally easier than writing a stateful iterator (e.g., consider a traversal for a complex tree structure such as an AVL tree);
  • As iterators are usually stateful, they introduce an implicit dependency among all expressions that use the iterator, whereas an enumeration function hides its state from clients;
  • Finally, iterators need some out-of-band method of indicating that the enf of the collection has been reached (e.g. an isEmpty method, or a special collection empty return code or exception), whereas an enumeration function simply returns when it reaches the end of the collection.

These are all compelling reasons for enumeration functions, and this is the approach that Scheme took to collections in SRFI 44: Collections [L2 ]. However, it should be noted that there are some drawbacks to using enumerator functions. While enumerator functions don't suffer from a problem of how to indicate end-of-collection, they suffer from the complimentary problem of how to indicate early termination of the enumeration: with an iterator, you simply stop using the iterator, but with an enumerator you have to signal to the enumerator function to stop the traversal. The article referenced above proposes using a boolean result indicator at each iteration which says whether to continue or not. In Tcl, however, we already have the break and continue exceptions for exactly this purpose, so the problem is effectively solved without introducing new machinery. The other, more serious, problem of enumeration functions is that it is difficult to enumerate over more than one collection at once in a general way. foreach provides an elegant solution to this, which can probably be generalised for the interface I propose below. However, it is also possible to define a routine that converts any collection enumerator function of the form defined into an iterator. I describe how to do this after presenting the proposal for a general collection traversal interface.

Proposal

The proposal is to use a generalised fold operation as the default enumeration mechanism for any collections defined. A collection can be anything that logically contains some elements, including: data-structures such as the tcllib graph and tree, OO container classes, database query interfaces, XML documents, general parsers for data-files (where the elements of the container are elements of the syntax, e.g. BibTeX records), and many others. This is deliberately a very general definition, meant to apply to a wide variety of different situations. By defining a standard interface for traversals, we can then define lots of utility functions in terms of this interface which will work efficiently for many different collection types.

The interface proposed is:

 $collection enumerate ?options..? proc seed

where 'proc' is a procedure to be called for each element of the collection, and the 'seed' argument is an initial value for a parameter of that proc. The 'proc' argument should be of the form:

 proc myproc {seed item} { ...; return $newseed }

The syntax of the interface resembles that of object oriented extensions to Tcl such as incr Tcl or XOTcl. However, it should also be possible to implement the interface using other means. For instance, many extensions written in C provide a similar interface (e.g. tdom), but even for pure-Tcl extensions it is not too difficult to author. To make it easier to support pure-Tcl extensions, clients of the interface should assume that $collection is a command-prefix (i.e. list) rather than a single command, and should be expanded before being called:

 # In Tcl 8.5:
 set result [{expand}$collection enumerate ...]
 # In <8.5
 set result [eval $collection enumerate ...]
 # Or (safer)
 proc invoke {cmd args} { uplevel 1 $cmd $args }
 set result [invoke $collection enumerate ...]

In this way normal ensemble commands can be supported as collection interfaces with little extra machinery.

Anyway, let's put syntax to one side and look now and the proposed semantics. In its basic form, this operation works just like a left-fold operation. The haskell definition of foldl is:

 foldl proc seed []         = seed
 foldl proc seed (x:xs)     = foldl proc (proc seed x) xs
 -- e.g. foldl (+) 0 [1,2,3] --> (((0+1)+2)+3) = 6

It runs through each element of the collection from left-to-right (with suitable definitions of 'left' and 'right' for the collection type) and calls the 'proc' procedure with the current item and the seed value. The procedure then performs any calculations and returns a new seed value for the next iteration -- in this way, the seed value becomes accumulated state (e.g. a running total) that can be calculated as we traverse the collection, without relying on external variables. This fold operation is extremely general and powerful. For more information on folds see this tutorial: [L3 ]. In addition to returning the seed values, the iteration procedure can also return a Tcl exception code to indicate early termination of the traversal. These exceptions work with the usual semantics that they have inside loop bodies:

  • TCL_OK: normal, proceed with next iteration;
  • TCL_CONTINUE: skip to next iteration with same seed values as last time;
  • TCL_BREAK: terminate iteration and return current seed values;
  • other: abort iteration and propagate the exception.

To demonstrate, here is an implementation of the interface for text files (using a TOOT-like implementation):

 proc invoke {cmd args} { uplevel #0 $cmd $args }
 proc exceptiontype {code} {
     set token [lindex {ok error return break continue} $code]
     return [expr {$token eq "" ? $code : $token}]
 }
 proc file_enumerate {file proc seed} {
     set fd [open $file]
     foreach line [split [read $fd] \n] {
         set rc [catch {invoke $proc $seed $line} result]
         switch -exact [exceptiontype $rc] {
             ok         { set seed $result }
             break      { break }
             continue   { continue }
             default    {
                 # Cleanup and propagate (needs improving)
                 close $fd
                 return -code $rc -errorinfo $::errorInfo $result
             }
         }
     }
     close $fd
     return $seed
 }
 proc file: {file method args} {
     if {$method eq "enumerate"} {
         uplevel 1 [linsert $args 0 file_enumerate $file]
     } else {
         uplevel 1 [linsert $args 0 file $method $file]
     }
 }

Having this interface in place allows us to write code like the following:

 proc def {name = args} {
     uplevel 1 [linsert $args 0 interp alias {} $name {}]
 }
 # Demo
 def myfile = file: test.txt
 proc count {total _} { incr total }
 puts "[myfile nativename] contains [myfile enumerate count 0] lines"
 proc longest {longest line} {
     if {[string length $line] > [string length $longest]} {
         return $line
     } else {
         return $longest
     }
 }
 puts "The longest line is [myfile enumerate longest {}]"

Some benefits of this code become apparent. Firstly, we don't have to worry about opening and closing the file, as that is taken care of for us. We also don't have to worry about cleaning up if an exception occurs in the middle of processing as that is also taken care of for us by the file_enumerate implementation. Now, at present this implementation isn't particularly impressive in terms of performance, although it is adequate. But imagine how fast it would be if file_enumerate were coded in C. We could do all sorts of tricks, like memory-mapping the file and using optimised code to traverse the lines, avoiding line-ending conversions etc, and just calling back into a byte-compiled Tcl proc as needed. Could we then approximate the speed of wc -l in Counting a million lines? Could be worth a try.

Defining generic operations

Once we have our very general enumerate procedure, we can also then define other operations in terms of it. For instance, here are the classic map and filter generalised to arbitrary collections:

 proc map {proc collection} {
     invoke $collection enumerate [list map-helper $proc] [list]
 }
 proc map-helper {proc accum item} {
     lappend accum [invoke $proc $item]
 }
 proc filter {proc collection} {
     invoke $collection enumerate [list filter-helper $proc] [list]
 }
 proc filter-helper {proc accum item} {
     if {[invoke $proc $item]} {
         lappend accum $item
     } 
     return $accum
 }

We can test these on our file enumeration example:

 puts "The sizes of each line in [myfile nativename] are:"
 puts [join [map {string length} myfile] \n]
 proc linebigger {size line} { expr {[string length $line]>$size} }
 puts "The lines bigger than 80 chars:"
 puts [join [filter {linebigger 80} myfile] \n]

As you can see, nothing in the definition of map or filter depends on any knowledge of the implementation of file_enumerate, or even on the type of the underlying collection: we could write 'enumerate' methods for lists, dictionaries, database queries and others and the generic definitions of map and filter will work with all of them without alteration.

Options

The proposed interface also allows for some options to be given to the enumerate method. The full list of options and values accepted could vary from collection to collection, but an initial list of standard options that might be supported is presented here:

  • -direction left|right: Specifies a traversal direction, allowing for an optional right-fold to be implemented as well. Could also be extended to allow specifying specific traversal orderings like pre- or post- or in-order traversal of graphs/trees.
  • -completioncommand proc: Specifies a callback to call with the final seed value when the collection has been traversed. Presence of this flag enables an asyncronous operation using the event loop. The enumerate method returns immediately and carries on processing in the background using after or some other means (e.g. fileevents) to schedule calls to the iteration procedure. Could return some sort of token that can be used to cancel the traversal.
  • -progresscommand proc: A separate progress command that is called at regular intervals to indicate how far through the collection the traversal is. This could be used for updating a progress-bar during a long traversal, for instance.
  • -errorcommand proc: Callback to handle when the iteration command returns an error (default would be to just return the error again). This would allow for flexible error handling of e.g. parse errors during a traversal.

There are probably other options that can be thought of. Conventions for common options can evolve over time.

Converting an enumeration into an iterator

As I mentioned earlier, one of the drawbacks of an enumeration method compared to an iterator is that there is no easier way to combine enumeration of two separate collections. This is essentially because the enumeration method moves the loop (the traversal code) out of the application code and into the collection code: if you have two collections, then which one gets to be in charge of the enumeration? And how does it access elements of the other collection? foreach manages to do this, but only because it knows that each collection it enumerates is a list and so it knows how to iterate through all of them. A general enumeration scheme cannot make such a guarantee, so it would seem that we are stuck. In the cases where you want to iterate through multiple collections at once, it seems that an iterator-style interface is the only sensible option. Thankfully, there is a general way to convert any enumerator style interface of the sort proposed into an iterator style interface, which will now be demonstrated. In a language with continuations this conversion can be accomplished relatively easily, however Tcl does not have such a feature. Oleg Kiselyov has another article [L4 ] describing how to do the conversion in a language without continuations, in his case Haskell. I will now show a conversion of that code into Tcl.

The details of the conversion are not entirely straight-forward. In order to allow our traversal to be turned inside-out, we have to introduce a new parameter to the enumerate interface which will be used to make recursive calls to enumerate for each iteration. To avoid changing the interface presented above, this new parameter could be introduced as an option -self proc. Unfortunately, the conversion also requires rewriting our enumeration function in a rather inefficient way: using recursion instead of iteration. In Tcl, this makes the resulting code perform quite poorly, so it may be wise for collection authors to hand-code a custom iterator interface rather than relying on this conversion, at least for now. Perhaps if Tcl ever adds continuations, or some way can be found to reduce the overhead of this conversion then it would become more practical. Still, it is worth seeing what the conversion looks like.

I will demonstrate the conversion using an enumeration interface for lists, as they are perhaps the simplest collection. Our iterator interface will actually be a stream, which is the nicest way of exposing an iterator as it also manages to encapsulate the state of the traversal. First, then, we will define our stream data-type (i.e., a lazy list):

 namespace eval stream {
    proc delay  args        { return $args }
    proc force  item        { uplevel #0 $item }
    proc nil    {}          { delay error "empty list" }
    proc cons   {head tail} { delay list $head $tail }
    proc head   stream      { lindex [force $stream] 0 }
    proc tail   stream      { force [lindex [force $stream] 1] }
    proc take   {n stream}  {
        if {$n == 0} { return [list] }
        linsert [take [incr n -1] [tail $stream]] 0 \
                [head $stream]
    }
    namespace export {[a-z]*}
    namespace ensemble create ;# 8.5ism!
 }

Here, streams are represented as pairs of unevaluated expressions: the first element when evaluated generates a value, and the second element generates a new stream (the 'tail' of the stream). This is basically a lazy version of Lisp's cons-cells. Now we define our enumeration method for lists. However, as we are adding an extra 'self' parameter, we will call this method "superfold" rather than enumerate.

 proc lsuperfold {self list proc seed} {
     if {[llength $list] == 0} {
         return $seed
     } else {
         set tail [lassign $list head]
         invoke $self $tail $proc [invoke $proc $seed $head]
     }
 }
 proc list: {list method args} {
     uplevel 1 [linsert $args 0 l$method $list]
 }

Now, from this we can derive our original enumerate method without the extra 'self' parameter by simply passing it itself (i.e., we find its fix-point, as you would using the Y-combinator: Hot Curry):

 proc lenumerate {list proc seed} {
     lsuperfold lenumerate $list $proc $seed
 }

And, just to demonstrate that this conforms to our original definition, we can use the generic map and filter functions on it as usual:

 def l = list: {1 2 3 4 5 6 7 8 9 10}
 proc + {a b} { expr {$a + $b} }
 puts "add1 to all = [map {+ 1} l]"

But adding the explicit 'self' parameter also allows us interrupt a traversal in the middle and capture its state (essentially we have made the continuation explicit, a technique known as CPS - continuation-passing style). Here, we will capture the enumeration continuation and store it in a stream:

 proc fold2stream {fold collection} {
     invoke $fold [list streamnext $fold] \
            $collection snd [stream nil]
 }
 proc snd {a b} { return $b }
 proc streamnext {fold tail proc seed} {
     stream cons $seed [stream delay fold2stream $fold $tail]
 }

We can now use this to convert our list into a stream:

 def lstream = fold2stream lsuperfold
 set l [list 1 2 3 4 5 6 7 8 9 10]
 set s [lstream $l]
 puts "First 4 = [stream take 4 $s]"

Discussion

NEM: The original article about the Scheme version generalises the enumerate method further by allowing it to take and return multiple seeds, i.e.:

 $collection enumerate $proc seed seed...

I initially implemented Tcl versions that way, but in practice I found it was clumsy to use for the common case where you only have 1 seed. For instance, having to wrap return values with: return list $value was something it is too easy to overlook. One seed should be sufficient as you can always pack up arbitrary data-structures into it. The only advantage of the multi-seed version is that it takes care of some of the unpacking and checking of these values for you (as they are just normal proc arguments).

Any comments are welcome. I'm a big fan of using folds in this way, as I think it works well with scripting by moving loops into C code (assuming collections implemented in C). The only real problem is that of looping over two collections at once. The automatic conversion works well in Scheme (where call/cc handles most of the details), but in Tcl it could do with some work. If anyone has any good ideas about how to improve it, please share.

At the moment, this proposal is just a suggestion. When it has matured a bit, I wonder if it might be worth writing an Informational TIP or similar to propose some standard interfaces for collections? Something similar to the Scheme SRFI referenced above.


[ Category Concept ]