Manipulating sets in Tcl

Manipulating sets in Tcl

Arjen Markus I wrote this short note to show how easy it is to program basic operations on (finite) sets, operations such as union and intersection, using nothing but Tcl lists. This note also provides some insight in extending these operations to (countably) infinite sets.

I will also address a number of design issues that I thought worth writing down. Most of these decisions are arbitrary: the alternatives are not much worse than the solution that was chosen.

If we talk about sets, we mean an unordered collection of unique items, possibly empty, perhaps infinite in number. Tcl lists are a good way to store finite collections in memory, as they provide almost anything we need. The only drawback is that Tcl lists do not guarantee uniqueness. So we need a method to ensure this uniqueness.

LV: Why list instead of array? With an array, you get the uniqueness, and you may get some performance benefit (with a sacrifice of some memory usage).

Arjen Markus: passing arrays to procedures is a bit more work than passing lists, but it could be a nice alternative. As with many such decisions: there is not a truly best choice.

On second thought, using lists instead of arrays enables a much simpler implementation of the infinite sets - the problem there is to guarantee that, say, the union of two sets still exists after any of the sets has disappeared. It is done in the sample implementation by incorporating the definitions of the sets as sublists (something that would be awkward with arrays).

Note:

If we simply use variables, then we do not have to worry about cleaning up. Hence, for simplicity's sake we use variables and not procedures or any of the object-oriented approaches to store our sets.

Note: Some of the procedures described can be found in a well-known extension like TclX. In addition, TclX offers several others as well, like [intersect3].

Basic operations For sets we have the following basic operations:

Union, construct a new set out of all (unique) elements of two or more sets
Intersection, construct a new set out of all elements that belong to all given sets.
Exclusion, construct a new set out of all elements in the first set that are not contained in the second set.
Membership, does a given set have a given item as one of its elements?

Other operations could be:

"All", do all elements of a set satisfy a certain condition?
"Any", does at least one of the elements satisfy a given condition?
"Select", construct a subset whose elements consist of those in a given set that satisfy a given condition.
"Map", apply some kind of mapping to all elements of a set and construct a new one of the results.

Since we use lists to represent our (finite) sets, a procedure to add a new element to an existing set is simple:

   proc addElement { orglist elem } {
      upvar $orglist list

      if { [lsearch -exact $list $elem] == -1 } {
         lappend list $elem
      }
   }

The union of two sets could be handled by the following procedure (which exploits the uniqueness of all elements within a set):

   proc union { lista listb } {
      set result $lista

      foreach elem $listb {
         if { [lsearch -exact $lista $elem] == -1 } {
            lappend result $elem
         }
      }
      return $result
   }

Intersection is only slightly different:

   proc intersection { lista listb } {
      set result {}

      foreach elem $listb {
         if { [lsearch -exact $lista $elem] != -1 } {
            lappend result $elem
         }
      }
      return $result
   }

And exclusion too:

   proc exclusion { lista listb } {
      set result {}

      foreach elem $lista {
         if { [lsearch -exact $listb $elem] == -1 } {
            lappend result $elem
         }
      }
      return $result
   }

Determining if a set contains some item is easy:

   proc hasElement { lista elem } {
      return [expr [lsearch $lista $elem] >= 0]
   }

The extended operations can be implemented as variations of the following procedure:

   proc all { lista condition } {
      set result 1
      foreach elem $lista {
         set cond [expr $condition]
         if { ! $cond } {
            set result 0
            break
         }
      }
      return $result
   }

Note:

With the procedure all we introduce another design issue. If we want to supply source code that is to be evaluated to a procedure, how do we substitute the variables? One way is to supply the name of a procedure and use a standard interface to this procedure. This is not very flexible: if the condition changes, we need a new procedure, which may be used only once.

Another way is to assume that the variables have some specific name, in this case, we use the name elem to supply the value of the element in the set to the condition. For simple constructions like the above (only one variable) this is quite flexible. If more variables are involved, then more care is needed.

So, if we use these operations in a small sample program, we can illustrate their use:

   set setA  {}
   addElement setA A
   addElement setA B
   addElement setA C

   set setB  {}
   addElement setB "D aa"
   addElement setB B
   addElement setB C
   addElement setB C    ;# Only one occurrence of C will be stored

   puts "Set A: $setA"
   puts "Set B: $setB"

   puts "Union of A and B: [union $setA $setB]"
   puts "Intersection:     [intersection $setA $setB]"
   puts "Set A has element B? [hasElement $setA "B"]
   puts "All elements of set B are short? [all $setB {[string length $elem] < 3}]"

Another task could be to compare the contents of two directories:

There are two subtasks, one to see if a file exists in both directories.
The other to see if the contents of these files are equal

A script to implement this task might look like this:

   # Auxiliary procedure: two sets are equal if their union and
   # their intersection have the same length
   #
   proc areSetsEqual { lista listb } {
      set all_elems    [union $lista $listb]
      set common_elems [intersection $lista $listb]

      return [expr [llength $all_elems] == [llength $common_elems]]
   }

   # Get the lists of files (each file name is unique within the
   # directory)

   set files1 [glob [file join $dir1 *]]
   set files2 [glob [file join $dir2 *]]

   if { ! [areSetsEqual $files1 $files2] } {
      puts "Directories contain different files!"
   } else {
      #
      # We need more details - get the CRC checksum for this
      #
      set files_crc_1 [map $files1 {[list $elem [crc32 $elem]]}
      set files_crc_2 [map $files2 {[list $elem [crc32 $elem]]}

      if { ! [areSetsEqual $files_crc_1 $files_crc_2] } {
         puts "One or more files with the same name differ in content!"
      }
   }

Of course, any real-life script would report which files are missing from the directories and which files differ in content, but for the discussion here, these are irrelevant details.

Infinite sets

Certain types of infinite sets can be handled gracefully as well. The key is to recognise that in practice one is never interested in the set as a whole, but more in finite iteration processes over the elements of the set.

One problem that can be at least approximately solved is the finding the solution of equations like:

   z^3 = x^2 + y^2

where x, y and z are integer numbers. The approximation is that you want to if there are solutions with |x|, |y| and |z| below 1000, say, and, if there are, which combinations of x, y and z.

A perhaps more common example of "infinite" sets or the way they might usefully be handled is that of selections from a database. Commonly, the programming model for dealing with such selections is to have a cursor or pointer into the selected records and to extract the records one by one in a loop.

As their number is not known in advance (at least in general), this type of processing is quite similar to our previous example: we have a set of items and need to iterate over this set until the set is exhausted, we are satisfied with the result or simply ran out of patience.

Note: Reading from a large ordinary file could fit into this model as well.

Now, iterating over a set of unknown size (meaning: a set that may not exist in memory) requires methods like:

Get the first element
Get the next one or, more generally, get the successor of a given element.

Any operation, involving only one set is easily expressed in terms of such an iteration:

Select "all" elements from the set of natural numbers that is a cube of another natural number:

   proc first { } {
      return 0
   }
   proc next { elem } {
      return [expr $elem+1]
   }
   proc isCube { elem } {
      set cc [expr pow(double($elem),0.33333)]
      set ic [expr int($cc+0.5)]
      return [expr {$ic*$ic*$ic==$elem}]
   }

   set satisfied 0
   set elem      [first]
   while { ! $satisfied } {
      if { [isCube $elem] } {
         puts $elem
      }
      set elem      [next $elem]
      set satisfied [expr {$elem>1000}]
   }

Operations like union and intersection that involve two or more sets are much more intricate. You must ensure that the elements are unique and that none are lost.

If we want to have the same functionality with infinite sets as we have with finite setsm then we face a number of additional issues:

As we can not put the elements in a kist as such, we must generate them one at a time. This can be done with the first and next methods shown above, though the actual generation must be part of the sets, not of the procedures.
We will need to implement union and intersection operations in such a way that they define a new set that behaves just any ordinary set. However, first enumerating the first and then the second is impossible. First of all we would never see elements of the second set and we would not be able to ensure that common elements appear only once.

For instance: create the union of twofolds and five-folds. Picking alternatively the elements of the first and then the second set would produce: 0, 0, 2, 5, 4, 10, 6, 15, 8, 20, 10, 25, ... A lot of double elements appear!

We need a different approach. One such approach is to get the "next" element of both and decide which one is closest to the given element. This requires two things; the element given as the argument to the next operation need not be an element of the set, but just an indicator of which element we want and there needs to be a "sense" of distance that is the same for the two sets. With these conditions fulfilled, we can implement union as:

   proc union { seta setb } {
      ... Construct the "union" set of the two
      return unionab
   }

   proc first { setany } {
      ... Other cases
      if { [isUnion $setany] } {
         set seta   [getUnionSet $setany 0]
         set setb   [getUnionSet $setany 1]
         set firsta [first $seta]
         set firstb [first $setb]
         if { [distance $seta $firsta] <= [distance $setb $firstb] } {
            return $firsta
         } else {
            return $firstb
         }
      }
   }
   proc next { setany elem } {
      ... Other cases
      if { [isUnion $setany] } {
         set seta   [getUnionSet $setany 0]
         set setb   [getUnionSet $setany 1]
         set nexta [next $seta $elem]
         set nextb [next $setb $elem]
         if { [distance $seta $nexta] <= [distance $setb $nextb] } {
            return $nexta
         } else {
            return $nextb
         }
      }
   }

Similarly, all other operations can be defined.

Because the operations define new sets that behave identical to the original ones, we can combine an arbitrary number of sets:

   set grand_unified_set [union seta [union setb [union setc setd]]]

(union being an associative and commutative operation, the order is irrelevant)

Note: To maintain the convenience of variables that hold the definitions of the sets, we have augmented the interface of the first and next procedures. These now get called recursively. There remain two issues, bad performance and cartesian products. Selections, as well as intersections and exclusions may be a problem, because they may cause either infinite loops or very long running loops. Consider the set of natural numbers that fulfill the condition {$elem<1000}. By inspection of this simple we see immediately that this set is finite. But would it be illegal to ask for the successor of 999 in that set? No, because in general we do not know in advance that the set is actually finite. To illustrate this with a less trivial example: consider the set of Mersenne primes. Only a small number is known, the largest having at least several hundreds of digits. It is not known whether there is a finite number of them.

The same holds for intersections like the set of even numbers with the set of primes - there is only one number in this intersection - or an exclusion (the set of prime numbers except all the odd numbers). This poses upon us the need to limit the search and to communicate the failure (not being able to come up with the next element) to the user.

Note: This is less trivial than it may seem, as an empty element {} might be a valid result! The most elegant way out of this is to provide a method that indicates the success or failure. One could generate an exception or an error, but then this can hardly be called an exception in the true sense of the word: it is simply a real possibility!

The cartesian product of two (or more) sets is the set of all pairs (tuples) of elements from both sets:

   {1,2,3} X {4,5} = { {1,4}, {1,5}, {2,4}, {2,5}, {3,4}, {3,5}}

The cartesian product can be used to generate a sequence of points that cover a two-dimensional (integer) space or that enumerate the set of rational numbers. It simply requires a trick to do that:

           |
           |
       2  (3)
           |
       1  (1)  (4)
           |
       0  (0)--(2)--(5)---

           0    1    2

You can list the points on the integer grid by the following procedure:

The first point is (0,0)
Then move to the line made of points (x,y) where x+y = 1 (x,y > 0) This line contains the points (1,0) and (0,1)
Then the next line: x+y = 2 (x,y > 0) This line contains the points (2,0), (1,1) and (0,2)
And so on

This method can be applied to any set, if you recognise that the integer coordinates represent the nth successor of the first element. There is also no problem to use this method recursively, so that three or more infinite sets are combined and you can iterate over a multidimensional space.

Just an illustration:

The set of integer numbers, Z, can be generated like this:

        first = 0
        proc next { elem } {
           if { $elem > 0 } {
              return [expr {-$elem}]
           } else {
              return [expr {1-$elem}]
           }
        }

     This gives the sequence: 0, 1, -1, 2, -2, ...

     The corresponding distance function is:

         expr { $elem>=0? 2*$elem : 1-2*$elem }

Two-dimensional integer space, ZxZ, can be generated with the above method, giving the sequence: (0,0), (1,0), (0,1), (-1,0), (1,1), (0,-1), (2,0), (-1,1), (1,-1), (0,2), ...

I have a (somewhat limited) implementation ready of the above ideas. As it is rather lengthy, I gave it its own page: Manipulating infinite sets in Tcl.

PN 2007.9.28 I have always wanted to have a go at countably infinite sets and put together this countably infinite sets. It is rather simple and rather quirky.

Category Discussion