Unique Element List

CJB: Returns the list without duplicate elements. This is different from [lsort -unique] in that it maintains the list ordering. It also keeps the first element encountered and removes subsequent callings.

proc uniqueList {list} {
  set new {}
  foreach item $list {
    if {[lsearch $new $item] < 0} {
      lappend new $item
    }
  }
  return $new
}

AMG: Here's another way which internally uses a hash table (amortized constant time per element, linear time overall) instead of a linear search (linear time per element, quadratic time overall):

proc uniqueList2 {list} {
  set dict {}
  foreach item $list {
    dict set dict $item ""
  }
  dict keys $dict
}

AMG: Performance timing:

set test [split "this is my test data" ""]
llength $test                                    ;#    20 elements
time {uniqueList  $test} 100000                  ;#    20.78 microseconds per iteration
time {uniqueList2 $test} 100000                  ;#    13.6  microseconds per iteration
set test [lrepeat 1000 {*}$test]; list
llength $test                                    ;# 20000 elements
time {uniqueList  $test} 1000                    ;# 17562.0 microseconds per iteration
time {uniqueList2 $test} 1000                    ;#  6844.0 microseconds per iteration

wdb Performance is reduced by usage of lsearch in procedure unique above. If you use the infix operator in resp. ni like here (needs Tcl 8.5) ...

proc uniqueList3 {list} {
  set new {}
  foreach item $list {
    if {$item ni $new} {
      lappend new $item
    }
  }
  return $new
}

time {uniqueList  $test} 100000   ;# 37.13617 microseconds per iteration
time {uniqueList2 $test} 100000   ;# 25.42803 microseconds per iteration
time {uniqueList3 $test} 100000   ;# 20.33975 microseconds per iteration

The gain over uniqueList2 is still there, but smaller. The corrected uniqueList3 runs in 80% the time of uniqueList2, whereas your original uniqueList3 ran in 75%.

Again, the results are different when the list length is much longer.

set test [lrepeat 1000 {*}[split "this is my test data" ""]]; list
time {uniqueList  $test} 1000       ;# 30554.174 microseconds per iteration
time {uniqueList2 $test} 1000       ;# 13517.655 microseconds per iteration
time {uniqueList3 $test} 1000       ;# 14456.274 microseconds per iteration

This is because uniqueList2 uses a different class of algorithm than uniqueList and uniqueList3. For shorter lists, uniqueList3 is fastest. For longer lists, uniqueList2 is fastest. Somewhere around 20,000 elements, in/ni becomes slower than dict, since in/ni uses a linear-time search whereas dict uses a constant-time search. Below 20,000 elements, the overhead of dict negates its performance benefit.

Of course, this is not a complete analysis. Also one must consider the performance for lists with lots of repetition versus lists with predominantly unique elements. But I don't think it's useful to go into this much detail, not on this page.


AMG: When (if) I write my [ring] command (i.e. set manipulation), the following will be all that's necessary:

proc uniqueListRing {list} {
    # Warning: vaporware! does not actually work!
    ring create {*}$list
}

It'll return a ring, which is my name for a set (since "set" is taken), and the string representation of a ring will be the same as that of a list of unique elements. Order will be preserved the same as CJB described above. The performance will be on the order of [uniqueList2], but it will be faster due to being implemented in C. Also the Tcl_Obj type of the returned value will be List, so no conversion will be necessary before doing list operations on it. This is because my plan is to augment the internal List struct with a pointer to an optional indexing structure that can be used to describe a dict or a ring.


CJB: Thanks, I have a short list so I'll use the ni version. I completely forgot about those operators. The dict version is neat, though I've never needed a list with 20000 items. Also, IMO I've never heard ring used as a synonym for a set so I don't like the name. I'd use series or collection.

AMG: A series is not a set, it's a value equal to the sum of the terms of a sequence. Collection could work, since it's a generic term for a set; but it also can refer to a multiset (unordered list, repetition allowed), or to a class or family of sets. Also I'd prefer a short name; four letters long is good, to match dict.

See [L1 ] for more on rings. A ring is a set along with definitions for "addition" and "multiplication" binary operators on the elements of the set. What I propose isn't a ring in its own right, since it doesn't include the operator definitions. But the operators can obviously be defined elsewhere in the script, so it can be used to create true rings. Finite rings, anyway. Remember, I'm only using the name "ring" because it's closely related to "set" and "set" is firmly taken.

I wonder. How can addition and multiplication possibly be defined for arbitrary strings? Addition can be concatenation, but multiplication? Multiplying a string by a number would be [string repeat], but what about multiplying a string by a string? That I don't get. Oh well, it doesn't matter anyway. The language doesn't have to define these operators; the script will do that, if it needs them.

FM: In french, we call it ensemble. A group of french mathematicians worked a lot on this subject [L2 ]. As this name allready exists in Tcl (namespace ensemble create create a set of commands), ensemble could be a better choice. This way, ring could be used only for true rings.

package require struct::set
namespace eval ensemble {
    namespace ensemble create -subcommands {
        empty size contains union intersect difference symdiff intersect3 equal include exclude add substract subsetof
    }
}
interp alias {} ensemble::empty      {} struct::set empty
interp alias {} ensemble::size       {} struct::set size
interp alias {} ensemble::contains   {} struct::set contains
interp alias {} ensemble::union      {} struct::set union
interp alias {} ensemble::intersect  {} struct::set intersect
interp alias {} ensemble::difference {} struct::set difference
interp alias {} ensemble::symdiff    {} struct::set symdiff
interp alias {} ensemble::intersect3 {} struct::set intersect3
interp alias {} ensemble::equal      {} struct::set equal
interp alias {} ensemble::include    {} struct::set include
interp alias {} ensemble::exclude    {} struct::set exclude
interp alias {} ensemble::add        {} struct::set add
interp alias {} ensemble::substract  {} struct::set substract
interp alias {} ensemble::subsetof   {} struct::set subsetof

console show

lassign {{} {}} E0 E1
foreach e {a b c d e f g h i j k l m n o p q r} {
    ensemble include E0 $e
}

foreach e {i j k l m n o p q r s t u v w x y z} {
    ensemble include E1 $e
}
set E3 []
set E4 []
puts "interface of an ensemble object"
puts E0\ :\ $E0
puts E1\ :\ $E1
puts E3\ :\ $E3
puts E4\ :\ $E4
puts empty\ E0\ ?\ [ensemble empty $E0]
puts empty\ E1\ ?\ [ensemble empty $E1]
puts empty\ E3\ ?\ [ensemble empty $E3]
puts empty\ E4\ ?\ [ensemble empty $E4]
puts size\ of\ E0\ :\ [ensemble size $E0]
puts size\ of\ E1\ :\ [ensemble size $E1]
puts size\ of\ E3\ :\ [ensemble size $E3]
puts size\ of\ E4\ :\ [ensemble size $E4]
puts union\ of\ E0\ and\ E1\ (ensemble\ E3)\ :\ [set E3 [lsort [ensemble union $E0 $E1]]]
puts intersect\ of\ E0\ and\ E1\ :\ [lsort [ensemble intersect $E0 $E1]]
puts intersect3\ of\ E0\ and\ E1\ :\ [lsort [ensemble intersect3 $E0 $E1]]
puts difference\ of\ E0\ and\ E1\ :\ [lsort [ensemble difference $E0 $E1]]
puts symdiff\ of\ E0\ and\ E1\ :\ [lsort [ensemble symdiff $E0 $E1]]
puts add\ E1\ to\ E4\ and\ E0\ to\ E4[ensemble add E4 $E0][ensemble add E4 $E1].
puts E0\ subsetof\ E3\ ?\ [ensemble subsetof $E0 $E3]
puts E4\ equal\ E3\ ?\ [ensemble equal $E3 $E4]
puts E3\ contains\ 'a'\ ?\ [ensemble contains $E3 a]
puts exclude\ 'a'\ from\ E3[ensemble exclude E3 a].
puts E3\ contains\ 'a'\ ?\ [ensemble contains $E3 a]
puts E3\ equal\ E4\ ?\ [ensemble equal $E3 $E4]
puts E3\ subsetof\ E4\ ?\ [ensemble subsetof $E3 $E4]