SetOps, Intersection

Purpose: Definition and discussion of set functionality, especially the function to compute the intersection of two or more sets.

Back to the Chart of proposed set functionality.

: ::setops::intersect set1 set2...

Arguments:

Any number of arguments (including none).
Each argument represents one the sets to intersect.
The sets are given by value, as tcl-lists.

Result:

A list containing the intersection of the argument sets.

Implementation:

I assume that intersection of multiple sets is finally reduced to intersection of two sets. Several variants to implement this are possible:

Sort both lists, then scan them and compare their first elements. Take an element if they are both equal, else remove the smaller element from its list. Stop if any of the lists is/gets empty. Predicted performance: O(n+2nlogn), n = length of longest list/set. Variant A.
Convert both lists into arrays, then scan them and check whether an element is in both arrays. Predicted performance: O(4n). This is the naive implementation for using hashtables. Variant B.
Convert the longer list into an array, then scan the shorter and place all elements into the result which are in the array too. Predicted performance: O(2n). The optimal implementation with hashtables.
As in 3., but use the local namespace as implicit array. Predicted performance: O(2n), n as above. Variant D.

Possible shortcuts:

No arguments => Result is the empty set.
One argument => Result is that argument.
At least one argument is empty => Result is the empty set.
An intermediate intersection is empty => Result is the empty set.
Start with the smallest sets, as that will increase the propability of getting an empty intermediate intersection. Penalty: O(mlogm) for sorting the list of sets, m = number of sets.

Timing

Summary:

D outperformed the other variants, for all inputs. The behaviour for A*A is the same as for B, i.e. worst-case, see below, but still better than the other variants. The lower constant definitely comes through.
For non-intersecting random input B is better than A, in the long run.
Compared against the case above, the computation of the intersection of a set with itself is much faster in A, and much slower in B. I would say that this is best-case for A, and worst-case for B.

Variant A.

proc ::setops::Intersect2 {a b} {
    if {[llength $a] == 0} {
        return {}
    }
    if {[llength $b] == 0} {
        return {}
    }

    set res {}

    set a [lsort $a]
    set b [lsort $b]

    while {1} {
        # Store lindex/0,1 in var, access later faster ?
        set n [string compare [lindex $a 0] [lindex $b 0]]
        if {$n == 0} {
            # A = B => element in intersection
            lappend res [lindex $a 0]
            set a [lrange $a 1 end]
            set b [lrange $b 1 end]
        } elseif {$n > 0} {
            # A > B, remove B, we are beyond the element.
            set b [lrange $b 1 end]
        } else {
            # A < B, remove A, we are beyond the element.
            set a [lrange $a 1 end]
        }
        if {[llength $a] == 0} {
            return $res
        }
        if {[llength $b] == 0} {
            return $res
        }
    }

    return $res
}

Variant B.

proc ::setops::Intersect2 {a b} {
    if {[llength $a] == 0} {
        return {}
    }
    if {[llength $b] == 0} {
        return {}
    }

    set res {}

    foreach e $a {
        set aa($e) .
    }

    foreach e $b {
        set ba($e) .
    }

    foreach e $a {
        if {[info exists aa($e)] && [info exists ba($e)]} {
            lappend res $e
        }
    }

    foreach e $b {
        if {[info exists aa($e)] && [info exists ba($e)]} {
            lappend res $e
        }
    }

    return $res
}

MS: this algorithm returns a list where all elements appear twice.

Variant C.

proc ::setops::Intersect2 {a b} {
    if {[llength $a] == 0} {
        return {}
    }
    if {[llength $b] == 0} {
        return {}
    }

    set res {}

    if {[llength $a] < [llength $b]} {
        foreach e $b {
            set check($e) .
        }
        foreach e $a {
            if {[info exists check($e)]} {
                lappend res $e
            }
        }
    } else {
        foreach e $a {
            set check($e) .
        }
        foreach e $b {
            if {[info exists check($e)]} {
                lappend res $e
            }
        }
    }

    return $res
}

Variant D.

proc ::setops::Intersect2 {a b} {
    if {[llength $a] == 0} {
        return {}
    }
    if {[llength $b] == 0} {
        return {}
    }

    set res {}

    if {[llength $a] < [llength $b]} {
        foreach $b {.} {break}

        foreach e $a {
            if {[info exists $e]} {
                lappend res $e
            }
        }
    } else {
        foreach $a {.} {break}

        foreach e $b {
            if {[info exists $e]} {
                lappend res $e
            }
        }
    }

    return $res
}

MS: this algorithm will fail if $a contains an element 'b', as the first pass removes the list.

This algorithm pollutes the namespace of the proc with variable names taken from the list. It may unexpectedly fail when the list contains values that happen to be the variable names chosen above. --BvW

-- AK

I found this solution to be faster with optimisations in lsearch for -sorted

proc lintersect { a b } {
    set a [lsort $a]
    set result {}
    foreach element $b {
        if { [lsearch -sorted -increasing $a $element] != -1 } {
            lappend result $element
        }
    }
    return $result
}

--Bvw

Category Data Structure