Version 4 of SetOps, Symmetrical difference

Updated 2011-05-07 09:49:09 by dkf

Purpose: Definition and discussion of set functionality, especially the function to compute the symmetrical difference of two sets. Symmetrical difference being the set containing those elements which are in one of the sets, but not the other, the inverse of intersection.

Back to the Chart of proposed set functionality.


::setops::symdiff set1 set2


Arguments:

  • Exactly two arguments
  • Each argument represents one the sets to intersect.
  • The sets are given by value, as tcl-lists.

Result:

  • A list containing (set1 - set2) + (set2 - set1).

Implementation:

Several variants are possible:

  1. Like the first variant in SetOps, Intersection, but takes the result from a different branch. SetOps, Difference can use this technique too. Predicted performance O(n+2nlogn), n = length of longest set.
  2. Transform both lists into an array, then scan them and check for existence of the chosen element in array 1 <> existence in array 2. Predicted performance: O(4n), n above.
  3. Transform both lists into an array, then scan each and add all elements to the result array which is not in the array of the other list. Predicted performance: O(4n), n above.
  4. Use fastest union, difference and intersection algorithms to compute (a+b) -(a*b). Predicted performance: O(2n+2n+2n)=O(6n).
  5. Create an array with all the set elements as keys where the length of the value in the array is the number of instances of each element in the input sets. The output set is then those keys where the value in the array is of length 1. Predicted performance O(2m+2n) where m and ''n' are the lengths of the input lists. DKF

Possible shortcuts:

  • First argument is empty => Result is the second argument.
  • Second argument is empty => Result is the first argument.

Timing information:

Summary:

  • In the best case 1 is consistently faster than everything else. Unfortunately its worst-case behaviour is much slower than everything else too.
  • From fast to slow the order of variants is: 4, 3, 2, 1.
  • Before the invention of using the local namespace as implicit array, and the accompanying performance boost for "diff", "union" and "intersect" the simple implementation (4) was 2nd-best after 3.
  • Performance of 5 and 6 still to be measured...

Variant 1.

    proc ::setops::symdiff {a b} {
        if {[llength $a] == 0} {
            return $b
        }
        if {[llength $b] == 0} {
            return $a
        }

        set res {}

        set a [lsort $a]
        set b [lsort $b]

        while {1} {
            # Store lindex/0,1 in var, access later faster ?
            set n [string compare [lindex $a 0] [lindex $b 0]]
            if {$n == 0} {
                # A = B => element in both, so not in sym. difference
                set a [lrange $a 1 end]
                set b [lrange $b 1 end]
            } elseif {$n > 0} {
                # A > B, remove B, we are beyond the element. 
                # This element in B is part of the result too.
                lappend res [lindex $b 0]
                set b [lrange $b 1 end]
            } else {
                # A < B, remove A, we are beyond the element.
                # This element in A is part of the result too.
                lappend res [lindex $a 0]
                set a [lrange $a 1 end]
            }
            if {[llength $a] == 0} {
                foreach e $b {
                    lappend res $e
                }

                return $res
            }
            if {[llength $b] == 0} {
                foreach e $a {
                    lappend res $e
                }

                return $res
            }
        }

        return $res
    }

Variant 2.

    proc ::setops::symdiff {a b} {
        if {[llength $a] == 0} {
            return $b
        }
        if {[llength $b] == 0} {
            return $a
        }

        set res {}

        foreach e $a {
            set aa($e) .
        }

        foreach e $b {
            set ba($e) .
        }

        foreach e $a {
            if {[info exists aa($e)] != [info exists ba($e)]} {
                lappend res $e
            }
        }

        foreach e $b {
            if {[info exists aa($e)] != [info exists ba($e)]} {
                lappend res $e
            }
        }

        return $res
    }

Variant 3.

    proc ::setops::symdiff {a b} {
        if {[llength $a] == 0} {
            return $b
        }
        if {[llength $b] == 0} {
            return $a
        }

        set res {}

        foreach e $a {
            set aa($e) .
        }

        foreach e $b {
            set ba($e) .
        }

        foreach e $a {
            if {![info exists ba($e)]} {
                lappend res $e
            }
        }

        foreach e $b {
            if {![info exists aa($e)]} {
                lappend res $e
            }
        }

        return $res
    }

Variant 4.

    proc ::setops::symdiff {a b} {
        diff [union $a $b] [Intersect2 $a $b]
    }

Variant 5. - DKF

    proc ::setops::symdiff {a b} {
        if {[llength $a] == 0} {
            set ret $b
        } elseif {[llength $b] == 0} {
            set ret $a
        } else {
            foreach key $a {set ary($key) "1"};# Assume input is set!
            foreach key $b {append ary($key) "2"}
            set ret {}
            foreach {key val} [array get ary] {
                if {[string length $val] == 1} {
                    lappend ret $key
                }
            }
        }
        set ret
    }

Variant 6. - 5a (Numbers instead of strings)

    proc ::setops::symdiff {a b} {
        if {[llength $a] == 0} {
            set ret $b
        } elseif {[llength $b] == 0} {
            set ret $a
        } else {
            foreach key $a {incr ary($key)} ; # Error, ary($key)
            foreach key $b {incr ary($key)} ; # possibly undefined
            set ret {}
            foreach {key val} [array get ary] {
                if {$val == 1} {
                    lappend ret $key
                }
            }
        }
        set ret
    }

DKF - But this doesn't work, since incr barfs on an undefined variable. Pity...

AK - Found that out while testing too. Seems that I got bitten from a perlie (Perl initializes the var to 0 and then performs the ++). I'll leave it in as an example of a bug.

DKF: Of course, this will actually work on 8.5 as the semantics of incr got changed.


Variant 7 - 5b (uses ideas from this page)

    proc ::setops::symdiff {a b} {
        if {[llength $a] == 0} {
            set ret $b
        } elseif {[llength $b] == 0} {
            set ret $a
        } else {
            foreach key $a {set ary($key) ""}
            set ret {}  
            foreach key $b {
                if {[info exist ary($key)]} {
                    unset ary($key)
                } {
                    lappend ret $key
                }
            }
            foreach key [array names ary] {lappend ret $key}
            # AK ?speed: eval lappend ret [array names ary]
        }
        set ret
    }

This variant seems to be nearly as fast as 1 in my tests (unlike 5), but has the good property of never performing a sort, and thus never having any part which is O(n log n). Calculating the actual complexity of this variant is not easy, but it is certainly bounded in the worst case by O(3n) (assuming that the input lists are the same length but have no common elements and are not sorted.) - DKF


-- AK