Version 9 of Performance of Various Stack Implementations

Updated 2008-06-11 12:27:21 by dkf

NEM 2008-06-10: Whilst implementing some sample trie code, I happened to use a naive implementation of a queue, and was quite shocked at just how much of a performance penalty I had to pay: code that should have taken less than a second (even milliseconds) took almost a minute to complete! This inspired me to do some investigation of various implementations for simple data structures such as stacks and queues. Here I present a number of different implementations of a basic stack data structure and compare them for performance and memory usage.

Framework

First, here's a little module framework to hide away some of the boilerplate. You can skip this bit: a module is just a namespace ensemble and a method is just a proc with some sugar for upvar.

package require Tcl 8.5
# Simple modules
proc module {name body} {
    namespace eval ::$name $body
    namespace eval ::$name { namespace ensemble create }
}
proc cons {name args} {
    uplevel 1 [list proc $name $args { info level 0 }]
}
# A replacement for proc that exports the given command from the module.
# Also supports &foo syntax in parameters for automatically [upvar]ing
# variable references.
proc method {name params body} {
    set cmd "upvar 1"
    foreach p $params {
	set p [lindex $p 0]
	if {[string match &* $p]} {
	    append cmd " \${$p} " [string range $p 1 end]
	}
    }
    set body "$cmd\n$body"
    uplevel 1 [list proc $name $params $body]
    uplevel 1 [list namespace export $name]
}

Implementations

Naive Stack Implementation

First, here is the most basic stack implementation that is probably how most people would first try and implement a stack in Tcl. The stack is simply a list: pushing an element does an lappend and popping does an lindex to get the last element and then a lrange to truncate the stack list. As we will see, that lrange is an absolute killer for performance.

module SimpleStack {
    method init &s { set s [list] }
    method push {&s elem} {
	lappend s $elem
    }
    method pop &s {
	set r [lindex $s end]
	set s [lrange $s 0 end-1]
	return $r
    }
    method empty? &s { expr {[llength $s] == 0} }
}

Lars H, 2008-06-10: Well, there's little need to be quite so naive — a trivial K-combinator style rewrite of the above appears to restore linear complexity:

module SimpleStack2 {
    method init &s { set s [list] }
    method push {&s elem} {
	lappend s $elem
    }
    method pop &s {
	set r [lindex $s end]
	set s [lreplace $s [set s end] end] ; # Make sure [lreplace] operates on unshared object
	return $r
    }
    method empty? &s { expr {[llength $s] == 0} }
}

NEM: Not just restores, but actually appears to be the fastest. I should have remembered K, given that it came up in almost exactly the same kind of situation. Looking at the source of Tcl_ListObjReplace it seems in the unshared case where we are just deleting a single element from the end then this should be guaranteed to be an O(1) operation, bringing the "pop all" operation back indeed to O(N).

Maintain an index into a list

A natural adaptation of the above is to avoid the lrange by instead maintaining an index pointer into the stack list. The stack is grown as needed on push, but we never shrink it. Instead we just adjust an index that points to the top of the stack.

module IndexStack {
    # Store index as first element, rest are real elements
    method init &s { set s [list 0] }
    method push {&s elem} {
	set idx [lindex $s 0]
	if {[incr idx] >= [llength $s]} {
	    lappend s $elem
	} else {
	    lset s $idx $elem
	}
	lset s 0 $idx
    }
    method pop &s {
	set idx [lindex $s 0]
	set head [lindex $s $idx]
	incr idx -1
	lset s 0 $idx
	return $head
    }
    method empty? &s {
	expr {[lindex $s 0] == 0}
    }
}

Compacting Indexed Stack

Here is a version that compacts the stack list whenever we are using less than half of it. This adds a bit of amortized overhead to each pop call, but can save quite a bit of space for stacks that vary in size greatly.

# A version of IndexStack that also compacts the stack if it grows small
module CompactingIndexStack {
    method init &s { IndexStack init s }
    method push {&s elem} { IndexStack push s $elem }
    method pop &s {
	set r [IndexStack pop s]
	# Check to see if worth compacting
	set len [llength $s]
	if {$len > 10 && [lindex $s 0] < ($len/2.0)} {
	    # Yes...
	    set s [lrange $s 0 [lindex $s 0]]
	}
	return $r
    }
    method empty? &s { IndexStack empty? s }
}

Nested List (cons-cell) Stack

Another approach would be to use a linked list representation for the stack, as found in many functional programming languages. Here we present two approaches: one using nested lists as cons cells, and one using arrays. First, the nested-list representation:

# An implementation using nested lists as cons cells
module ConsStack {
    cons Nil
    cons Pair a b
	
    method init &s { set s [Nil] }
    method push {&s elem} {
	set s [Pair $elem $s]
    }
    method pop &s {
	switch -exact [lindex $s 0] {
	    Nil     { error "empty" }
	    Pair    { 
		set r [lindex $s 1]
		set s [lindex $s 2]
		return $r
	    }
	}
    }
    method empty? &s { expr {[llength $s] == 1} }
}

The implementation is almost as clear as the original simple version, but the performance is much better. You have to be careful though, as Tcl isn't great at handling deeply nested structures. For instance, changing the implementation of empty? to the following quickly blows the C stack causing a core-dump (generating the string rep of a deeply nested list is a bad idea, it seems).

method empty? &s { expr {$s eq [Nil]} }

Arrays as cons-cells

The final version here is similar to the previous one, but here we use arrays as mutable cons cells. This is closer to a Lisp implementation, but is cumbersome to create in Tcl and needs some housekeeping code:

# An implementation using arrays as cons cells
module ArConsStack {
    variable id 0
    proc Nil {} { return "" }
    proc Pair {a b} {
	variable id
	set name ::ArConsStack::cell[incr id]
	upvar #0 $name cell
	set cell(h) $a
	set cell(t) $b
	return $name
    }
    proc car t {
	upvar #0 $t cell
	return $cell(h)
    }
    proc cdr t {
	upvar #0 $t cell
	return $cell(t)
    }
    method init &s { set s [Nil] }
    method push {&s elem} {
	set s [Pair $elem $s]
    }
    method pop &s {
	if {$s eq ""} { error "empty" }
	set r [car $s]
	set s2 [cdr $s]
	# Cleanup
	unset $s
	set s $s2
	return $r
    }
    method empty? &s {
	expr {$s eq ""}
    }
}

If you have any more clever implementations, feel free to add them here.

Benchmarks

We use the following simple test framework to both unit test and benchmark the implementations. It's pretty basic but gives an idea. Lars H: I hacked test to allow varying the number of repetitions. For one thing, 100000 times took so long that the first time I ran it I thought it was stuck in an infinite loop. Also, you need to vary the number of repetitions to determine the complexity exponent.

proc benchmark {name body} {
    set time [lindex [uplevel 1 [list time $body]] 0]
    puts [format "%-10s : %10d us" $name $time]
}
proc assert {msg expr} {
    if {![uplevel 1 [list expr $expr]]} {
	error "ASSERT FAILED: $msg"
    }
}
proc test {stack {repetitions 100000}} {
    puts "\nTesting $stack $repetitions"
    # Basic unit/sanity checks
    $stack init s
    foreach x {a b c d e} { $stack push s $x }
    foreach x {e d c b a} {
	set y [$stack pop s]
	assert "$x == $y" {$y eq $x}
    }
    assert "stack empty" {[$stack empty? s]}
    unset s
    
    # Benchmark - create a big stack then pop every element at once
    $stack init s
    benchmark push { for {set i 0} {$i < $repetitions} {incr i} { $stack push s $i } }
    benchmark pop  { while {![$stack empty? s]} { $stack pop s } }
    puts "Final size: [string bytelength $s] bytes"
    
    # Now test with push/pops interleaved
    unset s
    benchmark push/pop {
	$stack init s
	for {set i 0} {10*$i < $repetitions} {incr i} {
	    for {set j 0} {$j < 10} {incr j} { $stack push s $j }
	    for {set j 9} {$j >= 0} {incr j -1} {
		set h [$stack pop s]
		assert "$h == $j" {$h == $j}
	    }
	}
    }
    puts "Final size: [string bytelength $s] bytes"
}
test SimpleStack
test IndexStack
test ConsStack
test ArConsStack

Which gives the following results (2GHz Intel Core Duo MacBook):

Testing SimpleStack
push       :     200875 us
pop	:   70945579 us
Final size: 0 bytes
push/pop   :     910583 us
Final size: 0 bytes

Testing SimpleStack2
push       :     201996 us
pop	:     303392 us
Final size: 0 bytes
push/pop   :     887688 us
Final size: 0 bytes

Testing IndexStack
push       :     245049 us
pop	:     272153 us
Final size: 588891 bytes
push/pop   :     914710 us
Final size: 21 bytes

Testing CompactingIndexStack
push       :     428960 us
pop	:     694956 us
Final size: 13 bytes
push/pop   :    1369530 us
Final size: 11 bytes

Testing ConsStack
push       :     346980 us
pop	:     464432 us
Final size: 3 bytes
push/pop   :    1196018 us
Final size: 3 bytes

Testing ArConsStack
push       :     696860 us
pop	:     671755 us
Final size: 0 bytes
push/pop   :    1669332 us
Final size: 0 bytes

Putting everything into a table

Module push (µs)pop (µs) Final size 1 (bytes)push/pop (µs)Final size 2 (bytes)
SimpleStack 200875 70945579 0 910583 0
SimpleStack2 201996 303392 0 887688 0
IndexStack 245049 272153 588891 914710 21
CompactingIndexStack 428960 694956 13 1369530 11
ConsStack 346980 464432 3 1196018 3
ArConsStack 696860 671755 0 1669332 0

As we can see, the performance of SimpleStack is utterly abismal, taking over 70 seconds to pop 100000 elements! However, for smallish stacks, it is perfectly adequate and performs about as well as the rest of the implementations. The IndexStack is a clear winner in terms of speed, but wastes quite a bit of memory if the stack ever grows large. A separate compacting stage which is called every now and again could deal with this (e.g. compact if less than half the list is being used), for a small amortized cost in performance. That is the solution of CompactingIndexStack. The overhead seems quite high compared to the linked list stack that follows it. The two cons-cell versions perform adequately. The ConsStack is a touch slower than the indexed version, while being able to efficiently shrink the stack as needed. The big problem though is the ease with which a heavily nested data structure can cause Tcl to crash! Be careful not to generate string reps you don't need! The ArConsStack performs adequately too, but suffers a bit of a performance overhead compared to lists

DKF: The problem with SimpleStack is precisely its quadratic performance behaviour (which totally dominates anything relating to reference management).

NEM: Indeed. My initial code that prompted this investigation was using the idiom:

 set xs [lassign $xs x]

as a simple in-place queue, which also suffers from the same quadratic behaviour. Incidentally, the tcllib versions of both stack and queue suffer from this O(N^2) implementation (using lreplace instead of lrange).

AK: That will change.

AK: I should further note that the tcllib packages have some methods like peek which are difficult to implement in some of the nicer representations. Incidentially, we have only benchmarks for sets and tree in the struct module. Would be good to put these up for stacks, and derive something for queues as well.

(DKF: I've reflected the table about a diagonal axis to make it easier to read.)