Performance of Various Stack Implementations

NEM 2008-06-10: Whilst implementing some sample trie code, I happened to use a naive implementation of a queue, and was quite shocked at just how much of a performance penalty I had to pay: code that should have taken less than a second (even milliseconds) took almost a minute to complete! This inspired me to do some investigation of various implementations for simple data structures such as stacks and queues. Here I present a number of different implementations of a basic stack data structure and compare them for performance and memory usage.


Framework

First, here's a little module framework to hide away some of the boilerplate. You can skip this bit: a module is just a namespace ensemble and a method is just a proc with some sugar for upvar.

package require Tcl 8.5
# Simple modules
proc module {name body} {
    namespace eval ::$name $body
    namespace eval ::$name { namespace ensemble create }
}
proc cons {name args} {
    uplevel 1 [list proc $name $args { info level 0 }]
}
# A replacement for proc that exports the given command from the module.
# Also supports &foo syntax in parameters for automatically [upvar]ing
# variable references.
proc method {name params body} {
    set cmd "upvar 1"
    foreach p $params {
        set p [lindex $p 0]
        if {[string match &* $p]} {
            append cmd " \${$p} " [string range $p 1 end]
        }
    }
    set body "$cmd\n$body"
    uplevel 1 [list proc $name $params $body]
    uplevel 1 [list namespace export $name]
}

Implementations

Naive Stack Implementation

First, here is the most basic stack implementation that is probably how most people would first try and implement a stack in Tcl. The stack is simply a list: pushing an element does an lappend and popping does an lindex to get the last element and then a lrange to truncate the stack list. As we will see, that lrange is an absolute killer for performance.

module SimpleStack {
    method init &s { set s [list] }
    method push {&s elem} {
        lappend s $elem
    }
    method pop &s {
        set r [lindex $s end]
        set s [lrange $s 0 end-1]
        return $r
    }
    method empty? &s { expr {[llength $s] == 0} }
}

Lars H, 2008-06-10: Well, there's little need to be quite so naive — a trivial K-combinator style rewrite of the above appears to restore linear complexity:

module SimpleStack2 {
    method init &s { set s [list] }
    method push {&s elem} {
        lappend s $elem
    }
    method pop &s {
        set r [lindex $s end]
        set s [lreplace $s [set s end] end] ; # Make sure [lreplace] operates on unshared object
        return $r
    }
    method empty? &s { expr {[llength $s] == 0} }
}

NEM: Not just restores, but actually appears to be the fastest. I should have remembered K, given that it came up in almost exactly the same kind of situation. Looking at the source of Tcl_ListObjReplace it seems in the unshared case where we are just deleting a single element from the end then this should be guaranteed to be an O(1) operation, bringing the "pop all" operation back indeed to O(N).

AK: Right, I forgot about K as well. Using it this actually becomes Maintain an index into a list internally, with Tcl tracking both #allocated list elements vs #used list elements, and the latter is our index. The only thing it doesn't do is to actually shrink the allocated memory when dropping below some threshold.

Regarding queues this doesn't help, as the system should have to copy everything down when removing something from the beginning of the list. There I normally use an index and double-buffering. I.e. new elements are appended into a second buffer while an index iterates over the first buffer when taking elements. When the first buffer is empty it is discarded (free memory), the second becomes first, and a new empty buffer is generated for appending. Fast and keeps memory use down as well.

Back to K, I have to wonder, how many operations are there which are of the form set X [operator ... $X ...] ? How complex would it be to detect this situation in bytecode and rewrite it to contain a (user-invisible) K ? This would be for something for Miguel Sofer and/or Kevin Kenny to look at, IIRC one of the things he wishes to look into are peep-hole optimizations and this type of rewrite seems that it could fit.

DKF: I suspect that's very difficult. The problem is that the knowledge that K-like stuff is a safe optimization is encoded at an entirely different conceptual level to the one where both the bytecode compiler works and where the peephole optimizations are performed.

ferrieux (20080619): As you may have noticed because you proposed lreplace instead, lrange as of today does not optimize the unshared case, in that it still duplicates the element array. See patch 1890831 "Lrange cleanup and in-place optimization" which does that. With this patch,

    set x [lrange $x 0 end-1]

is just as fast as the lreplace one, but a bit easier on the eyes ;-) Also, notice that for this kind of set X [operator ... $X ...] to take advantage of K-based derefs with a proc where X is an argument, an extra step is needed. Indeed, the stack frame structure holds a list for info level 0, which holds extra refs to all formal arguments. Experimental patch 1998273 provides an info zaplevel primitive who just nukes that info level 0, so that the working idiom guaranteeing unshared values is:

    proc f x {
       ...
       info zaplevel ;# no info level 0 beyond this point
       lrange $x[unset x] 0 end-1
    }

Lars H: If it's me you're addressing, then I partly have to disappoint you regarding the lreplace instead of lrange issue. The reason I started using it is merely that in the old days of Tcl 7 [sic!], the end-1 notation wasn't available — the alternative to [lreplace $s end end] was [lrange $s 0 [expr [llength $s]-2]]], which was much worse on the eyes. Of course, [set s end] is a more recent addition that aims to make the object unshared, but I don't think looking at the implementation of lreplace came first.

Maintain an index into a list

A natural adaptation of the above is to avoid the lrange by instead maintaining an index pointer into the stack list. The stack is grown as needed on push, but we never shrink it. Instead we just adjust an index that points to the top of the stack.

module IndexStack {
    # Store index as first element, rest are real elements
    method init &s { set s [list 0] }
    method push {&s elem} {
        set idx [lindex $s 0]
        if {[incr idx] >= [llength $s]} {
            lappend s $elem
        } else {
            lset s $idx $elem
        }
        lset s 0 $idx
    }
    method pop &s {
        set idx [lindex $s 0]
        set head [lindex $s $idx]
        incr idx -1
        lset s 0 $idx
        return $head
    }
    method empty? &s {
        expr {[lindex $s 0] == 0}
    }
}

Compacting Indexed Stack

Here is a version that compacts the stack list whenever we are using less than half of it. This adds a bit of amortized overhead to each pop call, but can save quite a bit of space for stacks that vary in size greatly.

# A version of IndexStack that also compacts the stack if it grows small
module CompactingIndexStack {
    method init &s { IndexStack init s }
    method push {&s elem} { IndexStack push s $elem }
    method pop &s {
        set r [IndexStack pop s]
        # Check to see if worth compacting
        set len [llength $s]
        if {$len > 10 && [lindex $s 0] < ($len/2.0)} {
            # Yes...
            set s [lrange $s 0 [lindex $s 0]]
        }
        return $r
    }
    method empty? &s { IndexStack empty? s }
}

Nested List (cons-cell) Stack

Another approach would be to use a linked list representation for the stack, as found in many functional programming languages. Here we present two approaches: one using nested lists as cons cells, and one using arrays. First, the nested-list representation:

# An implementation using nested lists as cons cells
module ConsStack {
    cons Nil
    cons Pair a b
        
    method init &s { set s [Nil] }
    method push {&s elem} {
        set s [Pair $elem $s]
    }
    method pop &s {
        switch -exact [lindex $s 0] {
            Nil     { error "empty" }
            Pair    { 
                set r [lindex $s 1]
                set s [lindex $s 2]
                return $r
            }
        }
    }
    method empty? &s { expr {[llength $s] == 1} }
}

The implementation is almost as clear as the original simple version, but the performance is much better. You have to be careful though, as Tcl isn't great at handling deeply nested structures. For instance, changing the implementation of empty? to the following quickly blows the C stack causing a core-dump (generating the string rep of a deeply nested list is a bad idea, it seems).

method empty? &s { expr {$s eq [Nil]} }

Arrays as cons-cells

The final version here is similar to the previous one, but here we use arrays as mutable cons cells. This is closer to a Lisp implementation, but is cumbersome to create in Tcl and needs some housekeeping code:

# An implementation using arrays as cons cells
module ArConsStack {
    variable id 0
    proc Nil {} { return "" }
    proc Pair {a b} {
        variable id
        set name ::ArConsStack::cell[incr id]
        upvar #0 $name cell
        set cell(h) $a
        set cell(t) $b
        return $name
    }
    proc car t {
        upvar #0 $t cell
        return $cell(h)
    }
    proc cdr t {
        upvar #0 $t cell
        return $cell(t)
    }
    method init &s { set s [Nil] }
    method push {&s elem} {
        set s [Pair $elem $s]
    }
    method pop &s {
        if {$s eq ""} { error "empty" }
        set r [car $s]
        set s2 [cdr $s]
        # Cleanup
        unset $s
        set s $s2
        return $r
    }
    method empty? &s {
        expr {$s eq ""}
    }
}

An array-based stack (using TclOO)

pmarin (2009-27-2): I have implemented an array-based stack an seems faster than the tcllib implementation, although my test is too naive:

package provide stack 0.5

package require Tcl 8.6 
package require TclOO
namespace import ::tcl::mathop::*
namespace import oo::*


class create stack {
   variable _stack _top 

   constructor {} {
      set _top  0
   }

   method empty {} {
      return [! $_top]
   }

   method clear {} {
      unset _stack
      set _top 0
   }
  
   method top {} {
      if { $_top > 0} {
         return  $_stack($_top)
      } else {
         return -code error "the stack is empty"
      }

   }

   method push { elem } {
      set _stack([incr _top]) $elem
   }

   method pop {} {
      if {$_top > 0} {
         unset _stack($_top)
         incr _top -1
      } else {
         return -code error "the stack is empty"
      }
   }

   method size {} {
      return [array size _stack]
   }

   destructor  {
      array unset _stack
      unset _top
   }
}

If you have any more clever implementations, feel free to add them here.


Benchmarks

We use the following simple test framework to both unit test and benchmark the implementations. It's pretty basic but gives an idea. Lars H: I hacked test to allow varying the number of repetitions. For one thing, 100000 times took so long that the first time I ran it I thought it was stuck in an infinite loop. Also, you need to vary the number of repetitions to determine the complexity exponent. DKF: See the procedure cps listed on the TclOO page, which tries to compute a number of runs that allows the main timing run to take about a second…

proc benchmark {name body} {
    set time [lindex [uplevel 1 [list time $body]] 0]
    puts [format "%-10s : %10d us" $name $time]
}
proc assert {msg expr} {
    if {![uplevel 1 [list expr $expr]]} {
        error "ASSERT FAILED: $msg"
    }
}
proc test {stack {repetitions 100000}} {
    puts "\nTesting $stack $repetitions"
    # Basic unit/sanity checks
    $stack init s
    foreach x {a b c d e} { $stack push s $x }
    foreach x {e d c b a} {
        set y [$stack pop s]
        assert "$x == $y" {$y eq $x}
    }
    assert "stack empty" {[$stack empty? s]}
    unset s
    
    # Benchmark - create a big stack then pop every element at once
    $stack init s
    benchmark push { for {set i 0} {$i < $repetitions} {incr i} { $stack push s $i } }
    benchmark pop  { while {![$stack empty? s]} { $stack pop s } }
    puts "Final size: [string bytelength $s] bytes"
    
    # Now test with push/pops interleaved
    unset s
    benchmark push/pop {
        $stack init s
        for {set i 0} {10*$i < $repetitions} {incr i} {
            for {set j 0} {$j < 10} {incr j} { $stack push s $j }
            for {set j 9} {$j >= 0} {incr j -1} {
                set h [$stack pop s]
                assert "$h == $j" {$h == $j}
            }
        }
    }
    puts "Final size: [string bytelength $s] bytes"
}
test SimpleStack
test IndexStack
test ConsStack
test ArConsStack

Which gives the following results (2GHz Intel Core Duo MacBook):

Testing SimpleStack
push       :     200875 us
pop        :   70945579 us
Final size: 0 bytes
push/pop   :     910583 us
Final size: 0 bytes

Testing SimpleStack2
push       :     201996 us
pop        :     303392 us
Final size: 0 bytes
push/pop   :     887688 us
Final size: 0 bytes

Testing IndexStack
push       :     245049 us
pop        :     272153 us
Final size: 588891 bytes
push/pop   :     914710 us
Final size: 21 bytes

Testing CompactingIndexStack
push       :     428960 us
pop        :     694956 us
Final size: 13 bytes
push/pop   :    1369530 us
Final size: 11 bytes

Testing ConsStack
push       :     346980 us
pop        :     464432 us
Final size: 3 bytes
push/pop   :    1196018 us
Final size: 3 bytes

Testing ArConsStack
push       :     696860 us
pop        :     671755 us
Final size: 0 bytes
push/pop   :    1669332 us
Final size: 0 bytes

Lars H: Note that the size numbers here really have very little to do with the actual size in memory of the corresponding data structure. For one, string bytelength only gives the size of the string representation, but that is typically not what we have. Memory costs with Tcl provides more detailed information. Another problem is that the amount of memory in use at the end can be very different from the amount of memory allocated. Measuring memory usage provides for measurements from the outside.

Putting everything into a table (DKF: I've reflected the table about a diagonal axis to make it easier to read.)

Module push (µs) pop (µs) Final size #1 (bytes) push/pop (µs) Final size #2 (bytes)
SimpleStack 200875 70945579 0 910583 0
SimpleStack2 201996 303392 0 887688 0
IndexStack 245049 272153 588891 914710 21
CompactingIndexStack 428960 694956 13 1369530 11
ConsStack 346980 464432 3 1196018 3
ArConsStack 696860 671755 0 1669332 0

As we can see, the performance of SimpleStack is utterly abysmal, taking over 70 seconds to pop 100000 elements! However, for smallish stacks, it is perfectly adequate and performs about as well as the rest of the implementations. The IndexStack is a clear winner in terms of speed, but wastes quite a bit of memory if the stack ever grows large. A separate compacting stage which is called every now and again could deal with this (e.g. compact if less than half the list is being used), for a small amortized cost in performance. That is the solution of CompactingIndexStack. The overhead seems quite high compared to the linked list stack that follows it. The two cons-cell versions perform adequately. The ConsStack is a touch slower than the indexed version, while being able to efficiently shrink the stack as needed. The big problem though is the ease with which a heavily nested data structure can cause Tcl to crash! Be careful not to generate string reps you don't need! The ArConsStack performs adequately too, but suffers a bit of a performance overhead compared to lists

DKF: The problem with SimpleStack is precisely its quadratic performance behaviour (which totally dominates anything relating to reference management).

NEM: Indeed. My initial code that prompted this investigation was using the idiom:

 set xs [lassign $xs x]

as a simple in-place queue, which also suffers from the same quadratic behaviour. Incidentally, the tcllib versions of both stack and queue suffer from this O(N²) implementation (using lreplace instead of lrange).

AK: That will change.

AK: I should further note that the tcllib packages have some methods like peek which are difficult to implement in some of the nicer representations. Incidentially, we have only benchmarks for sets and tree in the struct module. Would be good to put these up for stacks, and derive something for queues as well.

AK (2008-06-24): June 18/19 saw the creation of a benchmark for Tcllib's struct::stack, speedup of the Tcl implementation, and the creation of a C implementation via critcl.