Version 8 of Unique Element List

Updated 2010-07-19 22:52:08 by CJB

CJB: Returns the list without duplicate elements. This is different from [lsort -unique] in that it maintains the list ordering. It also keeps the first element encountered and removes subsequent callings.

proc uniqueList {list} {
  set new {}
  foreach item $list {
    if {[lsearch $new $item] < 0} {
      lappend new $item
    }
  }
  return $new
}

AMG: Here's another way which internally uses a hash table (amortized constant time per element, linear time overall) instead of a linear search (linear time per element, quadratic time overall):

proc uniqueList2 {list} {
  set dict {}
  foreach item $list {
    dict set dict $item ""
  }
  dict keys $dict
}

AMG: Performance timing:

set test [split "this is my test data" ""]
llength $test                                    ;#    20 elements
time {uniqueList  $test} 100000                  ;#    20.78 microseconds per iteration
time {uniqueList2 $test} 100000                  ;#    13.6  microseconds per iteration
set test [lrepeat 1000 {*}$test]; list
llength $test                                    ;# 20000 elements
time {uniqueList  $test} 1000                    ;# 17562.0 microseconds per iteration
time {uniqueList2 $test} 1000                    ;#  6844.0 microseconds per iteration

wdb Performance is reduced by usage of lsearch in procedure unique above. If you use the infix operator in resp. ni like here (needs Tcl 8.5) ...

proc uniqueList3 {list} {
  set new {}
  foreach item $list {
    if {$item ni $list} {
      lappend new $item
    }
  }
  return $new
}

... then the results show a different image:

% time {uniqueList  $test} 100000 
29.59143 microseconds per iteration
% time {uniqueList2  $test} 100000 
22.08167 microseconds per iteration
% time {uniqueList3  $test} 100000 
16.23675 microseconds per iteration

AMG: Yikes, there is a critical error in your code!! It always returns empty. Let me fix and retest:

proc uniqueList3 {list} {
  set new {}
  foreach item $list {
    if {$item ni $new} {
      lappend new $item
    }
  }
  return $new
}

time {uniqueList  $test} 100000   ;# 37.13617 microseconds per iteration
time {uniqueList2 $test} 100000   ;# 25.42803 microseconds per iteration
time {uniqueList3 $test} 100000   ;# 20.33975 microseconds per iteration

The gain over uniqueList2 is still there, but smaller. The corrected uniqueList3 runs in 80% the time of uniqueList2, whereas your original uniqueList3 ran in 75%.

Again, the results are different when the list length is much longer.

set test [lrepeat 1000 {*}[split "this is my test data" ""]]; list
time {uniqueList  $test} 1000       ;# 30554.174 microseconds per iteration
time {uniqueList2 $test} 1000       ;# 13517.655 microseconds per iteration
time {uniqueList3 $test} 1000       ;# 14456.274 microseconds per iteration

This is because uniqueList2 uses a different class of algorithm than uniqueList and uniqueList3. For shorter lists, uniqueList3 is fastest. For longer lists, uniqueList2 is fastest. Somewhere around 20,000 elements, in/ni becomes slower than dict, since in/ni uses a linear-time search whereas dict uses a constant-time search. Below 20,000 elements, the overhead of dict negates its performance benefit.

Of course, this is not a complete analysis. Also one must consider the performance for lists with lots of repetition versus lists with predominantly unique elements. But I don't think it's useful to go into this much detail, not on this page.


AMG: When (if) I write my [ring] command (i.e. set manipulation), the following will be all that's necessary:

proc uniqueListRing {list} {
    # Warning: vaporware! does not actually work!
    ring create {*}$list
}

It'll return a ring, which is my name for a set (since "set" is taken), and the string representation of a ring will be the same as that of a list of unique elements. Order will be preserved the same as CJB described above. The performance will be on the order of [uniqueList2], but it will be faster due to being implemented in C. Also the Tcl_Obj type of the returned value will be List, so no conversion will be necessary before doing list operations on it. This is because my plan is to augment the internal List struct with a pointer to an optional indexing structure that can be used to describe a dict or a ring.


CJB: Thanks, I have a short list so I'll use the ni version. I completely forgot about those operators. The dict version is neat, though I've never needed a list with 20000 items. Also, IMO I've never heard ring used as a synonym for a set so I don't like the name. I'd use series or collection.