[NEM] 2008-06-10: Whilst implementing some sample [trie] code, I happened to use a naive implementation of a [queue], and was quite shocked at just how much of a performance penalty I had to pay: code that should have taken less than a second (even milliseconds) took almost a minute to complete! This inspired me to do some investigation of various implementations for simple data structures such as [stacks and queues]. Here I present a number of different implementations of a basic [stack] data structure and compare them for performance and memory usage. ---- **Framework** First, here's a little module framework to hide away some of the boilerplate. You can skip this bit: a module is just a [namespace ensemble] and a method is just a proc with some sugar for [upvar]. ====== package require Tcl 8.5 # Simple modules proc module {name body} { namespace eval ::$name $body namespace eval ::$name { namespace ensemble create } } proc cons {name args} { uplevel 1 [list proc $name $args { info level 0 }] } # A replacement for proc that exports the given command from the module. # Also supports &foo syntax in parameters for automatically [upvar]ing # variable references. proc method {name params body} { set cmd "upvar 1" foreach p $params { set p [lindex $p 0] if {[string match &* $p]} { append cmd " \${$p} " [string range $p 1 end] } } set body "$cmd\n$body" uplevel 1 [list proc $name $params $body] uplevel 1 [list namespace export $name] } ====== ---- **Implementations** ***Naive Stack Implementation*** First, here is the most basic stack implementation that is probably how most people would first try and implement a stack in Tcl. The stack is simply a list: pushing an element does an [lappend] and popping does an [lindex] to get the last element and then a [lrange] to truncate the stack list. As we will see, that [lrange] is an absolute killer for performance. ====== module SimpleStack { method init &s { set s [list] } method push {&s elem} { lappend s $elem } method pop &s { set r [lindex $s end] set s [lrange $s 0 end-1] return $r } method empty? &s { expr {[llength $s] == 0} } } ====== ''[Lars H], 2008-06-10: Well, there's little need to be quite so naive — a trivial [K]-combinator style rewrite of the above appears to restore linear complexity:'' ====== module SimpleStack2 { method init &s { set s [list] } method push {&s elem} { lappend s $elem } method pop &s { set r [lindex $s end] set s [lreplace $s [set s end] end] ; # Make sure [lreplace] operates on unshared object return $r } method empty? &s { expr {[llength $s] == 0} } } ====== [NEM]: Not just restores, but actually appears to be the fastest. I should have remembered [K], given that it came up in almost exactly the same kind of situation. Looking at the source of [Tcl_ListObjReplace] it seems in the unshared case where we are just deleting a single element from the end then this should be guaranteed to be an O(1) operation, bringing the "pop all" operation back indeed to O(N). [AK]: Right, I forgot about [K] as well. Using it this actually becomes ''Maintain an index into a list'' internally, with Tcl tracking both #allocated list elements vs #used list elements, and the latter is our index. The only thing it doesn't do is to actually shrink the allocated memory when dropping below some threshold. Regarding queues this doesn't help, as the system should have to copy everything down when removing something from the beginning of the list. There I normally use an index and double-buffering. I.e. new elements are appended into a second buffer while an index iterates over the first buffer when taking elements. When the first buffer is ''empty'' it is discarded (free memory), the second becomes first, and a new empty buffer is generated for appending. Fast and keeps memory use down as well. Back to [K], I have to wonder, how many operations are there which are of the form ''set X [[operator ... $X ...]]'' ? How complex would it be to detect this situation in bytecode and rewrite it to contain a (user-invisible) [K] ? This would be for something for [Miguel Sofer] and/or [Kevin Kenny] to look at, IIRC one of the things he wishes to look into are ''peep-hole optimizations'' and this type of rewrite seems that it could fit. [DKF]: I suspect that's very difficult. The problem is that the knowledge that K-like stuff is a safe optimization is encoded at an entirely different conceptual level to the one where both the bytecode compiler works and where the peephole optimizations are performed. [ferrieux] (20080619): As you may have noticed because you proposed [lreplace] instead, [lrange] as of today does not optimize the unshared case, in that it still duplicates the element array. See [patch 1890831] "Lrange cleanup and in-place optimization" which does that. With this patch, set x [lrange $x 0 end-1] is just as fast as the [lreplace] one, but a bit easier on the eyes ;-) Also, notice that for this kind of ''set X [[operator ... $X ...]]'' to take advantage of K-based derefs with a proc where X is an argument, an extra step is needed. Indeed, the stack frame structure holds a list for [info level 0], which holds extra refs to all formal arguments. Experimental [patch 1998273] provides an [info zaplevel] primitive who just nukes that [info level 0], so that the working idiom guaranteeing unshared values is: proc f x { ... info zaplevel ;# no info level 0 beyond this point lrange $x[unset x] 0 end-1 } [Lars H]: If it's me you're addressing, then I partly have to disappoint you regarding the [lreplace] instead of [lrange] issue. The reason I started using it is merely that in the old days of Tcl 7 [[sic!]], the end-1 notation wasn't available — the alternative to `[[lreplace $s end end]]` was `[[lrange $s 0 [[expr [[llength $s]]-2]]]]`, which was much worse on the eyes. Of course, `[[set s end]]` is a more recent addition that aims to make the object unshared, but I don't think looking at the implementation of [lreplace] came first. ***Maintain an index into a list*** A natural adaptation of the above is to avoid the [lrange] by instead maintaining an index pointer into the stack list. The stack is grown as needed on push, but we never shrink it. Instead we just adjust an index that points to the top of the stack. ====== module IndexStack { # Store index as first element, rest are real elements method init &s { set s [list 0] } method push {&s elem} { set idx [lindex $s 0] if {[incr idx] >= [llength $s]} { lappend s $elem } else { lset s $idx $elem } lset s 0 $idx } method pop &s { set idx [lindex $s 0] set head [lindex $s $idx] incr idx -1 lset s 0 $idx return $head } method empty? &s { expr {[lindex $s 0] == 0} } } ====== ***Compacting Indexed Stack*** Here is a version that compacts the stack list whenever we are using less than half of it. This adds a bit of amortized overhead to each pop call, but can save quite a bit of space for stacks that vary in size greatly. ====== # A version of IndexStack that also compacts the stack if it grows small module CompactingIndexStack { method init &s { IndexStack init s } method push {&s elem} { IndexStack push s $elem } method pop &s { set r [IndexStack pop s] # Check to see if worth compacting set len [llength $s] if {$len > 10 && [lindex $s 0] < ($len/2.0)} { # Yes... set s [lrange $s 0 [lindex $s 0]] } return $r } method empty? &s { IndexStack empty? s } } ====== ***Nested List (cons-cell) Stack*** Another approach would be to use a linked list representation for the stack, as found in many [functional programming] languages. Here we present two approaches: one using nested lists as cons cells, and one using arrays. First, the nested-list representation: ====== # An implementation using nested lists as cons cells module ConsStack { cons Nil cons Pair a b method init &s { set s [Nil] } method push {&s elem} { set s [Pair $elem $s] } method pop &s { switch -exact [lindex $s 0] { Nil { error "empty" } Pair { set r [lindex $s 1] set s [lindex $s 2] return $r } } } method empty? &s { expr {[llength $s] == 1} } } ====== The implementation is almost as clear as the original simple version, but the performance is much better. You have to be careful though, as Tcl isn't great at handling deeply nested structures. For instance, changing the implementation of empty? to the following quickly blows the C stack causing a core-dump (generating the string rep of a deeply nested list is a bad idea, it seems). ====== method empty? &s { expr {$s eq [Nil]} } ====== ***Arrays as cons-cells*** The final version here is similar to the previous one, but here we use arrays as mutable cons cells. This is closer to a [Lisp] implementation, but is cumbersome to create in Tcl and needs some housekeeping code: ====== # An implementation using arrays as cons cells module ArConsStack { variable id 0 proc Nil {} { return "" } proc Pair {a b} { variable id set name ::ArConsStack::cell[incr id] upvar #0 $name cell set cell(h) $a set cell(t) $b return $name } proc car t { upvar #0 $t cell return $cell(h) } proc cdr t { upvar #0 $t cell return $cell(t) } method init &s { set s [Nil] } method push {&s elem} { set s [Pair $elem $s] } method pop &s { if {$s eq ""} { error "empty" } set r [car $s] set s2 [cdr $s] # Cleanup unset $s set s $s2 return $r } method empty? &s { expr {$s eq ""} } } ====== *** An array-based stack (using TclOO) *** [pmarin] (2009-27-2): I have implemented an array-based stack an seems faster than the tcllib implementation, although my test is too naive: ====== package provide stack 0.5 package require Tcl 8.6 package require TclOO namespace import ::tcl::mathop::* namespace import oo::* class create stack { variable _stack _top constructor {} { set _top 0 } method empty {} { return [! $_top] } method clear {} { unset _stack set _top 0 } method top {} { if { $_top > 0} { return $_stack($_top) } else { return -code error "the stack is empty" } } method push { elem } { set _stack([incr _top]) $elem } method pop {} { if {$_top > 0} { unset _stack($_top) incr _top -1 } else { return -code error "the stack is empty" } } method size {} { return [array size _stack] } destructor { array unset _stack unset _top } } ====== If you have any more clever implementations, feel free to add them here. ---- **Benchmarks** We use the following simple test framework to both unit test and benchmark the implementations. It's pretty basic but gives an idea. ''— [Lars H]: I hacked '''test''' to allow varying the number of repetitions. For one thing, 100000 times took so long that the first time I ran it I thought it was stuck in an infinite loop. Also, you need to vary the number of repetitions to determine the complexity exponent.'' ''— [DKF]: See the procedure `cps` listed on the [TclOO] page, which tries to compute a number of runs that allows the main timing run to take about a second…'' ====== proc benchmark {name body} { set time [lindex [uplevel 1 [list time $body]] 0] puts [format "%-10s : %10d us" $name $time] } proc assert {msg expr} { if {![uplevel 1 [list expr $expr]]} { error "ASSERT FAILED: $msg" } } proc test {stack {repetitions 100000}} { puts "\nTesting $stack $repetitions" # Basic unit/sanity checks $stack init s foreach x {a b c d e} { $stack push s $x } foreach x {e d c b a} { set y [$stack pop s] assert "$x == $y" {$y eq $x} } assert "stack empty" {[$stack empty? s]} unset s # Benchmark - create a big stack then pop every element at once $stack init s benchmark push { for {set i 0} {$i < $repetitions} {incr i} { $stack push s $i } } benchmark pop { while {![$stack empty? s]} { $stack pop s } } puts "Final size: [string bytelength $s] bytes" # Now test with push/pops interleaved unset s benchmark push/pop { $stack init s for {set i 0} {10*$i < $repetitions} {incr i} { for {set j 0} {$j < 10} {incr j} { $stack push s $j } for {set j 9} {$j >= 0} {incr j -1} { set h [$stack pop s] assert "$h == $j" {$h == $j} } } } puts "Final size: [string bytelength $s] bytes" } test SimpleStack test IndexStack test ConsStack test ArConsStack ====== Which gives the following results (2GHz Intel Core Duo MacBook): ====== Testing SimpleStack push : 200875 us pop : 70945579 us Final size: 0 bytes push/pop : 910583 us Final size: 0 bytes Testing SimpleStack2 push : 201996 us pop : 303392 us Final size: 0 bytes push/pop : 887688 us Final size: 0 bytes Testing IndexStack push : 245049 us pop : 272153 us Final size: 588891 bytes push/pop : 914710 us Final size: 21 bytes Testing CompactingIndexStack push : 428960 us pop : 694956 us Final size: 13 bytes push/pop : 1369530 us Final size: 11 bytes Testing ConsStack push : 346980 us pop : 464432 us Final size: 3 bytes push/pop : 1196018 us Final size: 3 bytes Testing ArConsStack push : 696860 us pop : 671755 us Final size: 0 bytes push/pop : 1669332 us Final size: 0 bytes ====== ''[Lars H]: Note that the size numbers here really have very little to do with the actual size in memory of the corresponding data structure. For one, [string bytelength] only gives the size of the string representation, but that is typically not what we have. [Memory costs with Tcl] provides more detailed information. Another problem is that the amount of memory '''in use''' at the end can be very different from the amount of memory '''allocated'''. [Measuring memory usage] provides for measurements from the outside.'' Putting everything into a table (''[DKF]: I've reflected the table about a diagonal axis to make it easier to read.'') %| Module | push (µs) | pop (µs) | Final size #1 (bytes) | push/pop (µs) | Final size #2 (bytes) |% &| SimpleStack | 200875 | 70945579 | 0 | 910583 | 0 |& &| SimpleStack2 | 201996 | 303392 | 0 | 887688 | 0 |& &| IndexStack | 245049 | 272153 | 588891 | 914710 | 21 |& &| CompactingIndexStack | 428960 | 694956 | 13 | 1369530 | 11 |& &| ConsStack | 346980 | 464432 | 3 | 1196018 | 3 |& &| ArConsStack | 696860 | 671755 | 0 | 1669332 | 0 |& As we can see, the performance of SimpleStack is utterly abysmal, taking over 70 seconds to pop 100000 elements! However, for smallish stacks, it is perfectly adequate and performs about as well as the rest of the implementations. The IndexStack is a clear winner in terms of speed, but wastes quite a bit of memory if the stack ever grows large. A separate compacting stage which is called every now and again could deal with this (e.g. compact if less than half the list is being used), for a small amortized cost in performance. That is the solution of CompactingIndexStack. The overhead seems quite high compared to the linked list stack that follows it. The two cons-cell versions perform adequately. The ConsStack is a touch slower than the indexed version, while being able to efficiently shrink the stack as needed. The big problem though is the ease with which a heavily nested data structure can cause Tcl to crash! Be careful not to generate string reps you don't need! The ArConsStack performs adequately too, but suffers a bit of a performance overhead compared to lists [DKF]: The problem with `SimpleStack` is precisely its quadratic performance behaviour (which totally dominates anything relating to reference management). [NEM]: Indeed. My initial code that prompted this investigation was using the idiom: set xs [lassign $xs x] as a simple in-place queue, which also suffers from the same quadratic behaviour. Incidentally, the [tcllib] versions of both [stack] and [queue] suffer from this O(N²) implementation (using [lreplace] instead of [lrange]). [AK]: That will change. [AK]: I should further note that the [tcllib] packages have some methods like ''peek'' which are difficult to implement in some of the nicer representations. Incidentially, we have only benchmarks for sets and tree in the [struct] module. Would be good to put these up for stacks, and derive something for queues as well. [AK] (2008-06-24): June 18/19 saw the creation of a benchmark for [Tcllib]'s [struct::stack], speedup of the Tcl implementation, and the creation of a C implementation via [critcl]. ---- !!!!!! %| [Category Data Structure] | [Category Performance] |% !!!!!!