AMG Please bear with my silly mood and my overlong narrative. We'll fix it later. :^) I think I'll move the "story" elements to a new page, leave the sort algorithms here, and move the array-based set stuff to a page on sets.
Argument
I needed a set data type with the following properties, above and beyond those of basic lists:
I needed the following operations:
I didn't need to worry about removing elements, doing operations involving multiple sets (union, intersection, symmetric difference, subtraction, comparison, etc.), or generating power sets or anything like that.
Book 1 - Lists, or, the Author goes on at Length
So I thought the best way of doing it was with a list that always remained sorted. Why? Because this made the membership check very fast. And by the way, I used the variable name lst rather than set to avoid Vim erroneously highlighting it as a keyword. Also it is the name of a variable in the caller's stack frame, not the set itself.
proc scontains {lst elem} { upvar 1 $lst var return [expr {[lsearch -exact -sorted $var $elem] != -1}] }
This gives a nice logarithmic-time means of checking whether or not an element is in a set. For example, determining whether or not a 1000-element set contains any given element takes at most 10 checks. For comparison's sake, using an unsorted list to represent a set can require up to 1000 checks for this same situation.
And I might as well cover "iteration" while I'm laying the groundwork. A set, being represented as a list, can be used directly as a list so long as the operation is read-only. But nevertheless I'll wrap it as a proc.
proc slist {lst} { upvar 1 $lst var return $var }
This returns a list of every element present in the set.
Chapter 1 - In which Insertion is Developed
Next, insertion. But first I'll state that an empty set is represented as {}. I'll wrap it in a proc for the sake of completeness.
proc screate {lst} { upvar 1 $lst var set var {} }
Nonempty sets can be built up from there using insertions. Here's the simplest possible implementation:
proc sinsert_simple {lst elem} { upvar 1 $lst var set var [lsort -unique [lappend var $elem]] }
This works by first appending $elem to the end of the set list, then re-sorting the list while removing any duplicates. But this is rather wasteful because there is at most one duplicate (the newly-inserted $elem), yet this fact isn't communicated to [lsort] so it has no choice but to check the whole list for duplicates.
Before going any further, let's set up a benchmarking system to prepare performance. I'll be running these tests on a 486 running at 50MHz. If your computer is slower than this, don't worry, Tcl will still run. :^)
proc test {func} { screate lst puts [time {$func lst [expr {int(rand() * 1000)}]} 1000] }
This inserts 1000 random numbers into a set and prints the average insertion time. The numbers are themselves in the range 0 ... 999, so there will be a fair number of duplicates. (It might be an interesting probability problem to determine the distribution of the lengths of the sets after such a test...) Now let's try it with [sinsert_simple].
% test sinsert_simple 20445 microseconds per iteration
Yes, it really did take 20 seconds to run that test. Maybe it might go faster on computers of the future. Why, in another decade we might even be able to get it down to 10 seconds! :^)
Chapter 2 - Insertion gains in Sophistication
Next revision: Armed with the knowledge that the list representation need not change if $elem is already in the set, and that adding a new element never creates a duplicate, I can write:
proc sinsert_checkmemb {lst elem} { upvar 1 $lst var if {![scontains var $elem]} { set var [lsort [lappend var $elem]] } return $var }
I'm guessing that in my test about 40% of the insertions are duplicates, so the time should go down to 12267 microseconds or so. Let's check.
% test sinsert_checkmemb 12098 microseconds per iteration
That's a pretty big improvement. Consistent with by 40% guess (okay, it wasn't a guess, I checked [llength $lst]), it takes 40.8% less time to run [sinsert_checkmemb] than [sinsert_simple].
Chapter 3 - In which the Author milks Insertion for all it's Worth
But there are more possibilities in store. Since there's at most one element to insert into a list that's already sorted, why re-sort the entire list? This next proc ditches [lsort] in favor of a binary search to determine at which point in the list the new element needs to be inserted (or if it is already present).
proc sinsert_bsearch {lst elem} { upvar 1 $lst var set left 0 set right [llength $var] while {$left < $right} { set idx [expr {($left + $right) / 2}] set dat [lindex $var $idx] switch -- [string compare $dat $elem] { -1 {set left [expr {$idx + 1}]} 0 {return $var } 1 {set right $idx } } } set var [linsert $var $left $elem] return $var }
I was initially hesitant about coding so much in Tcl, since the C-implemented [lsort] "should" be faster due to being compiled machine code, but the benchmark shows that algorithmic improvements give better results than switching from bytecode to machine code.
% test sinsert_bsearch 7011 microseconds per iteration
See, rather than wait a decade (okay, okay, 18 months) for hardware technology to double my program's speed, I could just drop in a better algorithm and get the same effect in no time at all.
Chapter 4 - In which the Author admits his Lie
But I've been rather irresponsible in my benchmarks. Going to a higher-performance class of algorithm (i.e. quadratic time to linear time to logarithmic time to constant time) always gives a speed boost for high n. (The n refers to the letter in Big O notation: O(n log n) for example.) But it also tends to increase coding complexity and setup time. In fact, it's quite common for an unsophisticated algorithm to outperform a fancy algorithm given that the actual data used has low values of n. So to even things out, first let's redo the benchmarks for various values of n rather than just 1000 as was done above.
foreach n {1 5 10 25 50 100 250 500 1000} { puts "n = $n" foreach func {sinsert_simple sinsert_checkmemb sinsert_bsearch} { puts -nonewline " $func: "; flush stdout set sum 0 for {set i 0} {$i < 5} {incr i} { screate lst set dur [time {$func lst [expr {int(rand() * $n)}]} $n] incr sum [lindex $dur 0] } puts [expr {$sum / 5}] } }
And here's the result, manually generated from the output of the above [foreach] loop:
n simple checkmemb bsearch 1 1611 2442 1428 5 1662 2282 2327 10 1736 2330 2743 25 1938 2578 3314 50 2337 2653 3844 100 3076 3070 4476 250 5771 4439 5481 500 10487 7234 6317 1000 20820 12650 7283
As you can see, [sinsert_bsearch] loses all the way up to around n=300, but from there on out its increased overhead pays for itself. [sinsert_simple] works fine until n=100, at which point [sinsert_checkmemb] starts to work better. The finished [sinsert] proc can determine which algorithm to use based on the list length, but first verify that the overhead of the [if] doesn't overwhelm your optimization.
I should also mention that the above table entries and the tradeoff points are all dependent on the nature of the data, specifically the ratio of unique insertions to duplicate insertions. So if you want to use any of this code in your own application, you should benchmark using actual data rather than blindly trusting the above.
Important rule of thumb: n is always small. Look before you optimize! I didn't. Which brings me to the next book in my tale of coding and woe.
Book 2 - Arrays, or, in which the Author realizes he is an Idiot
You may have noticed the use of past tense in the above narrative. I have since repented, deciding that a sorted list is not, in fact, the best way to represent a set. Per Joe English's casual suggestion (I hope I have that attribution correct), I tried using arrays. The results were both astonishing and humbling. Mostly humbling!
Chapter 1 - In which the Paradigm is Shifted
So I switched from using lists to arrays to represent my sets. For the sake of this Narrative we don't care about the value associated with any particular key in the array, so we'll just use "". But know that in your own program you're free to abuse arrays any way you wish.
Any given element can either be or not be in any given set. Any given key can either be or not be in any given array. See the parallel? The properties of sets map much more closely to arrays than they do to lists, wherein we have to do special things to avoid duplicate elements and discourage dependence on order.
First we need a way of creating a new, empty set. The following will do:
proc screate {arr} { upvar 1 $arr var array set var {} }
And here's the array version of [scontains]. I'm changing from lst to arr as my set variable name. Again, I'd just say set directly if not for Vim.
proc scontains {arr elem} { upvar 1 $arr var return [info exists var($elem)] }
Notice how easy that was. All the searching has moved from my script to inside the Tcl core, so this code will automatically take advantage of all future algorithmic breakthroughs in the field of fast array lookup and subsequent improvements to the Tcl array implementation.
It's a little more tricky to get a list of all elements in a set, now that the set is no longer itself a list. But it's definitely doable:
proc slist {arr} { upvar 1 $arr var return [array names $var] }
The returned list is by no means sorted, but sort order's not supposed to matter anyway. If you need it to be sorted, just use [lsort] afterwards. Of course this particular proc is less efficient than its list-only method from Book 1, but it's still workable.
Chapter 2 - In which the Difficult becomes Easy (and fast, too!)
Now for set insertion. If you will recall, this occupied the bulk of the above discussion, which is actually a good thing because this page is supposed to be about sorted lists, not arrays, and therefore I'll leave all that claptrap where it is just in case somebody decides he wants a sorted list.
proc sinsert {arr elem} { upvar 1 $arr var set var($elem) "" }
Don't blink or you'll miss it! This code is now very short and simple. And as it turns out, very fast. Here's proof:
foreach n {1 5 10 25 50 100 250 500 1000} { set sum 0 for {set i 0} {$i < 5} {incr i} { screate arr set dur [time {sinsert arr [expr {int(rand() * $n)}]} $n] incr sum [lindex $dur 0] } puts "n = $n : [expr {$sum / 5}]" }
Adding the results to the previous table yields absolutely marvelous results:
n simple checkmemb bsearch array 1 1611 2442 1428 1561 5 1662 2282 2327 1158 10 1736 2330 2743 1158 25 1938 2578 3314 1162 50 2337 2653 3844 1160 100 3076 3070 4476 1196 250 5771 4439 5481 1164 500 10487 7234 6317 1163 1000 20820 12650 7283 1168
We appear to have gone to constant time. Stunning.
I'm sure there's a hidden cost. Most likely the array representation of sets uses more memory than the list representation of sets. I don't have what I need to check the memory usage, though, so I'll leave that for someone else. Honestly, I don't really even care. If memory usage was crucial then I'd just use a slower algorithm.
Fin.
AMG I just discovered a shortcoming of the array-based set procedures. With them it is impossible to store sets inside arrays. This is fatal for the program I'm writing now. But that's not so bad since said program has n of up to, uhm, 50, and at that point performance is quite acceptable using the basic [lappend]/[lsort -unique] method.
AM Beware of code outside procedures: such code is not compiled and hence messes up the timings. I suspect that with the various versions inside a procedure the timing results will be very different!
AMG Okay, I redid the benchmarks using a proc. (Whereas above I did the table formatting manually this time I got smart and used MAK's [listToTable], found in Additional list functions. I modified it a bit to do right justification.)
n simple checkmemb bsearch array 1 1467 4322 1541 1890 5 2136 2202 2678 1345 10 2095 2679 3306 1298 25 2526 2275 4172 1272 50 3014 2041 4116 1047 100 4762 2196 4856 1122 250 10774 2931 5793 1051 500 21721 4263 6348 1063 1000 44843 7075 7122 1056
Yeah, that is different. Thanks for the tip. I can see more clearly now that using Tcl's built-in binary search (inside the implementation of [lsearch]) wins over my script-based binary search. But I think that if insertions were more frequent then [sinsert_bsearch] would outperform [sinsert_checkmemb]. Again, relative performance of different algorithms is often very data-dependent.
If the [lsearch -sorted] command could be made to return the index where an element should be even if not actually present in the list, that would give the best performance of all.
schlenk: tcllib, prioqueue and skiplist have some 'sorted list' kind of functionality.
NEM: With regard to the drawback of the array-based implementation, have you tried using dicts? They're in 8.5 and there is a package available which provides them for 8.4 too.
AMG: If all set access is performed through procedures ([screate], [sinsert], [slist], etc.), it's quite possible for the set implementation to store the set data in an internal namespace. For example:
namespace eval ::set { variable next_id 0 } proc ::set::create {} { set id $::set::next_id array set ::set::set_$id {} incr ::set::next_id return $id } proc ::set::include {id args} { foreach elem $args { set ::set::set_$id\($elem) "" } } proc ::set::contents {id} { return [array names ::set::set_$id] } % set quux [set::create] % set::include $quux alpha bravo charlie delta % set::contents $quux charlie delta bravo alpha
This allows client code to treat sets as ordinary (if opaque) strings yet still takes advantage of Tcl's array system.
AK: Regarding tcllib, it has set operations, under struct. Called struct::set.
AMG: I see [struct::set] makes no attempt at sorting or any other such "optimization" of dubious benefit. This is probably for the best because, since it has no clue about the nature of the data it is given, it cannot hope to select the proper algorithm. Therefore, it falls back on simple.
I'm half-tempted to also put this in Category Book. :^)