Version 20 of lrmdups

Updated 2005-04-01 08:13:31 by antirez

Procedure to remove duplicate elements from a list. The returned list will be sorted.

RS: You don't even need a procedure for this, because that's exactly what lsort -unique does:

 interp alias {} lrmdups {} lsort -unique

Testing this interp alias interactively with a tclsh:

 % lrmdups {foo bar grill bar foo}
 bar foo grill

SS: Well.. it depends. lsort -unique does it in O(NlogN), while the trivial algorithm to remove duplicated elements runs in O(N). So if your list is very very long lsort may not be an option. Also you may like to have a stable algorithm that will not alter the order of non-duplicated elements.

RS: OK, but that means the sortedness specification is removed... Here is the trivial algorithm:

 proc lrmdups list {
    set res {}
    foreach element $list {
       if {[lsearch -exact $res $element]<0} {lappend res $element}
    }
    set res
 }
 % lrmdups {foo bar grill bar foo}
 foo bar grill

SS: Oops, sorry Richard, I didn't noticed the The returned list will be sorted part. So what the author if this page wants is really lsort -unique. Still your algorithm is not O(N), but O(N^2). In order to have an O(N) algorithm you need to use a dictionary with O(1) SEARCH complexity in the average case to check if an element already exists or not, i.e. a Tcl array.

RS OK, can't resist a mini-challenge :)

 proc lrmdups list {
    array set a {}
    foreach element $list {set a($element) {}}
    lsort [array names a]
 }
 % lrmdups {foo bar grill bar foo}
 bar foo grill

SS That's nice, and if you remove the final lsort should have O(N) complexity indeed. Still it is not stable. The following is a stable version:

 proc lrmdups list {
     array set x {}
     set res {}
     foreach e $list {
         if {[info exists x($e)]} continue
         lappend res $e
         set x($e) {}
     }
     return $res
 }
 % lrmdups {foo bar grill bar foo}
 foo bar grill

DKF: I suspect that that's actually still super-linear, but the reasoning for working that out is non-trivial. It depends on the distribution pattern of hash-chain lengths and the fact that you're building the entire hash table from scratch. The performance should be really good though (and better with dict, which does less memory management nasties).

SS: Not considering lappend that's not O(1), and rehashing (so we have to assume we can resize the hash table at top of the procedure), I think that under perfect hashing to build the hash table is clearly O(N), lookups are O(1). As we are not in a perfect hashing situation, but the hash function and keys (assuming for example words) are not too strange keys to show too strong patterns, and assuming the hash function will have more or less the same behaviour as N grows, I think that chains of a given length are equally probably as N grows, so to build the hash table will always take O(N) operations. Lookups should be O(1) under the same assuption so I think this is an O(N) business. But my math is not good enough to test this. If somebody very skilled with this kind of problems (like KBK for example ;) may throw some light on this issue it will be cool.

MG You can use lsort to remove the duplicates even when you don't want sorted, with the -command option, and something like this...

 proc -1 {a b} {
   return -1
 }

 lsort -unique -command -1 {foo bar grill bar foo}

NEM Well, the answer I get from trying that is:

 () 49 % proc -1 {a b} { return -1 }
 () 50 % lsort -unique -command -1 {foo bar grill bar foo}
 foo bar grill bar foo

in 8.5a3 (current cvs head).

MG, several hours later: Hrm, I could've sworn it was right when I tried it earlier from home. Running it now at work, I see the same thing you do. I suspect that just means I shouldn't post things in the middle of the night... :) Will delete this in a few days, to save confusing people, and avoid too many people finding out I'm an idiot ;)


actually, the functionality for this, according to the Chart of existing list functionality page, exists in a few extensions (the description above is from TclX), but not directly in the core, what i was looking for (opening this up before, asking, then removing my question, as i had actually resorted to an extension for only one feature :^( ) was the ability to remove duplicated entries without sorting (or shuffling) the list, as i'm working with playlists and i'd rather not have the same thing in there 3-4 times, from importing other lists into one :^) this works great ... JMM