'''Manipulating sets in Tcl'''

[Arjen Markus] I wrote this short note to show how easy it is to program  basic operations on (finite) sets, operations such as ''union'' and ''intersection'', using nothing but Tcl lists. This note also provides some insight in extending these operations to (countably) ''infinite'' sets.

I will also address a number of design issues that I thought worth writing down. Most of these decisions are arbitrary: the alternatives
are not much worse than the solution that was chosen. 
----

If we talk about ''sets'', we mean an unordered collection of unique items, possibly empty, perhaps infinite in number. Tcl lists are a good way to store finite collections in memory, as they provide almost anything we need. The only drawback is that Tcl lists do not guarantee uniqueness. So we need a method to ensure this uniqueness.

''Note:''

If we simply use variables, then we do not have to worry about cleaning up. Hence, for simplicity's sake we use variables and not procedures or any of the object-oriented approaches to store our sets.

''Note:''
Some of the procedures described can be found in a well-known extension like TclX. In addition, TclX offers several others as well, like [[intersect3]].

----

''Basic operations''
For sets we have the following basic operations:

   * Union, construct a new set out of all (unique) elements of two or more sets
   * Intersection, construct a new set out of all elements that belong to all given sets.
   * Exclusion, construct a new set out of all elements in the first set that are ''not'' contained in the second set.
   * Membership, does a given set have a given item as one of its elements?

Other operations could be:

   * "All", do all elements of a set satisfy a certain condition? 
   * "Any", does at least one of the elements satisfy a given condition? 
   * "Select", construct a subset whose elements consist of those in a given set that satisfy a given condition.
   * "Map", apply some kind of mapping to all elements of a set and construct a new one of the results.

Since we use lists to represent our (finite) sets, a procedure to add a new element to an existing set is simple:

   proc addElement { orglist elem } {
      upvar $orglist list

      if { [lsearch -exact $list $elem] == -1 } {
         lappend list $elem
      }
   }

The union of two sets could be handled by the following procedure (which exploits the uniqueness of all elements within a set):

   proc union { lista listb } {
      set result $lista

      foreach elem $listb {
         if { [lsearch -exact $lista $elem] == -1 } {
            lappend result $elem
         }
      }
      return $result
   }

Intersection is only slightly different:

   proc intersection { lista listb } {
      set result {}

      foreach elem $listb {
         if { [lsearch -exact $lista $elem] != -1 } {
            lappend result $elem
         }
      }
      return $result
   }

And exclusion too:

   proc exclusion { lista listb } {
      set result {}

      foreach elem $lista {
         if { [lsearch -exact $listb $elem] == -1 } {
            lappend result $elem
         }
      }
      return $result
   }

Determining if a set contains some item is easy:

   proc hasElement { lista elem } {
      return [expr [lsearch $lista $elem] >= 0]
   }

The extended operations can be implemented as variations of the following procedure:

   proc all { lista condition } {
      set result 1
      foreach elem $lista {
         set cond [expr $condition]
         if { ! $cond } {
            set result 0
            break
         }
      }
      return $result
   }

''Note:''

With the procedure ''all'' we introduce another design issue. If we want to supply source code that is to be evaluated to a procedure, how do we substitute the variables? One way is to supply the name of a procedure and use a standard interface to this procedure. This is not very flexible: if the condition changes, we need a new procedure, which may be used only once.

Another way is to assume that the variables have some specific name, in this case, we use the name ''elem'' to supply the value of the element in the set to the condition. For simple constructions like the above (only one variable) this is quite flexible. If more variables are involved, then more care is needed.

----

So, if we use these operations in a small sample program, we can illustrate their use:

   set setA  {}
   addElement setA A
   addElement setA B
   addElement setA C

   set setB  {}
   addElement setB "D aa"
   addElement setB B
   addElement setB C
   addElement setB C    ;# Only one occurrence of C will be stored

   puts "Set A: $setA"
   puts "Set B: $setB"

   puts "Union of A and B: [union $setA $setB]"
   puts "Intersection:     [intersection $setA $setB]"
   puts "Set A has element B? [hasElement $setA "B"]
   puts "All elements of set B are short? [all $setB {[string length $elem] < 3}]"

Another task could be to compare the contents of two directories:

   * There are two subtasks, one to see if a file exists in both      directories.
   * The other to see if the contents of these files are equal

A script to implement this task might look like this:

   # Auxiliary procedure: two sets are equal if their union and
   # their intersection have the same length
   #
   proc areSetsEqual { lista listb } {
      set all_elems    [union $lista $listb]
      set common_elems [intersection $lista $listb]

      return [expr [llength $all_elems] == [llength $common_elems]]
   }

   # Get the lists of files (each file name is unique within the
   # directory)

   set files1 [glob [file join $dir1 *]]
   set files2 [glob [file join $dir2 *]]

   if { ! [areSetsEqual $files1 $files2] } {
      puts "Directories contain different files!"
   } else {
      #
      # We need more details - get the CRC checksum for this
      #
      set files_crc_1 [map $files1 {[list $elem [crc32 $elem]]}
      set files_crc_2 [map $files2 {[list $elem [crc32 $elem]]}

      if { ! [areSetsEqual $files_crc_1 $files_crc_2] } {
         puts "One or more files with the same name differ in content!"
      }
   }

Of course, any real-life script would report which files are missing from the directories and which files differ in content, but for the discussion here, these are irrelevant details.

----

''Infinite sets''

Certain types of infinite sets can be handled gracefully as well. The key is to recognise that in practice one is never interested in the set as a whole, but more in finite iteration processes over the elements of the set.

One problem that can be at least approximately solved is the finding the solution of equations like:

   z^3 = x^2 + y^2

where x, y and z are integer numbers. The approximation is that you want to if there are solutions with |x|, |y| and |z| below 1000, say, and, if there are, which combinations of x, y and z.

A perhaps more common example of "infinite" sets or the way they might usefully be handled is that of selections from a database. Commonly, the programming model for dealing with such selections is to have a ''cursor'' or pointer into the selected records and to extract the records one by one in a loop.

As their number is not known in advance (at least in general), this type of processing is quite similar to our previous example: we have a set of items and need to iterate over this set until the set is exhausted, we are satisfied with the result or simply ran out of patience.

''Note:''
Reading from a large ordinary file could fit into this model as well.

Now, iterating over a set of unknown size (meaning: a set that may not exist in memory) requires methods like:

   * Get the first element
   * Get the next one or, more generally, get the successor of a given element.

Any operation, involving only one set is easily expressed in terms of such an iteration:

   * Select "all" elements from the set of natural numbers that is a cube of another natural number:

   proc first { } {
      return 0
   }
   proc next { elem } {
      return [expr $elem+1]
   }
   proc isCube { elem } {
      set cc [expr pow(double($elem),0.33333)]
      set ic [expr int($cc+0.5)]
      return [expr {$ic*$ic*$ic==$elem}]
   }

   set satisfied 0
   set elem      [first]
   while { ! $satisfied } {
      if { [isCube $elem] } {
         puts $elem
      }
      set elem      [next $elem]
      set satisfied [expr {$elem>1000}]
   }

Operations like union and intersection that involve two or more sets are much more intricate. You must ensure that the elements are unique and that none are lost.

----

''I have not yet finished this intriguing topic'' It will require more than a handful of lines to explain this. Maybe a complete Tcl-only extension.