Version 3 of Adding a hashed datatype

Updated 2001-11-07 13:41:22

This page is the result of a comment on the Better Arrays for Tcl9 page -- JCW


This demo illustrates an experimental "hash" command and datatype for Tcl, which uses the usual odd-even representation to store data, plus an internal integer "map" vector to represent a hashing layer placed on top. See http://www.equi4.com/previews/hash.tar.gz for all source code.


This sounds from an API point of view similar to TclX's keyed lists. The mapping of course improves performance.

Yes, but TclX's keyed lists are lists of 2-item lists, while the odd-even lists below are a single list... makes quite a difference in speed and mem-use --jcw


Differences between hash objects and arrays:

  • hash entries are not variables, this model does not support traces
  • order can be preserved, though inserts/deletes will take O(N) time
  • memory use is lower than arrays (2 ints per item, plus free slots)

Performance-wise, the output below seems to indicate the following:

  • access is similar (the 29 vs. 20 uSec is maybe just call overhead)
  • creation from an odd-even list is over 25x faster than for arrays
  • full traversal takes the same amount of time for both
  • full traversal of the odd-even list is obviously faster for hashes
  • set/unset performance has not been measured (not implemented in C)

The current code does not support set/unset yet. Setting would either add or replace, depending on whether the key already exists.

Script:


  puts "version = [package require jolt]"

  if {[info commands nop] != ""} {
    puts "10 nops = [time {nop;nop;nop;nop;nop;nop;nop;nop;nop;nop} 10000]"
  }

  proc hash_try {} {
    set a {1 one 2 two 3 three}
    puts "data = [hash a]"
    puts "length = [hash a length]"
    puts "keys = [hash a keys]"
    puts "values = [hash a values]"
    puts "map = [hash a map]"
    puts "get 1 = [hash a get 1]"
    puts "get 2 = [hash a get 2]"
    puts "get 3 = [hash a get 3]"

    foreach x {1 10 100 1000 10000 100000} {
      set v {}
      for {set y 0} {$y < $x} {incr y} {
        lappend v $y [expr {$y+$y}]
      }
      set n [expr {1000000/$x}]
      if {$n > 100} { set n 100 }
      set r [lindex [time {set a $v; hash a} ${n}0] 0]
      set s [lindex [time {array unset b; array set b $v} $n] 0]
      puts [format "create %6d = %5d hash, vs. %d array" $x $r $s]
    }

    foreach x {0 1000 abc} {
      catch {hash a get $x} e1
      catch {set b($x)} e2
      puts "get $x: $e1 <-> $e2"
    }

    set x 1
    puts "10x h = [time {
        hash a get $x; hash a get $x; hash a get $x; hash a get $x;
        hash a get $x; hash a get $x; hash a get $x; hash a get $x;
                                      hash a get $x; hash a get $x} 10000]"
    puts "10x a = [time {
        set b($x); set b($x); set b($x); set b($x);
        set b($x); set b($x); set b($x); set b($x);
                              set b($x); set b($x)} 10000]"

    puts "h all = [time {foreach x [hash a keys] { hash a get $x }}]"
    puts "a all = [time {foreach x [array names b] { set b($x) }}]"
    puts "same = [expr {[lsort [hash a]] == [lsort [array get b]]}]"
    puts "h list = [time {foreach {x y} [hash a] { }}]"
    puts "a list = [time {foreach {x y} [array get b] { }}]"
  }

Output (SuSE 7.1 Linux, PIII/650):


  version = 2001.11.06.170201
  10 nops = 7 microseconds per iteration

    HASH
    ====

  data = 1 one 2 two 3 three
  length = 3
  keys = 1 2 3
  values = one two three
  map = 0 -1 0 -1 0 -1 0 -1 2105051955 2 -2061914958 4 0 -1 1977051568 0
  get 1 = one
  get 2 = two
  get 3 = three
  create      1 =     5 hash, vs. 12 array
  create     10 =     5 hash, vs. 31 array
  create    100 =     6 hash, vs. 323 array
  create   1000 =    30 hash, vs. 2310 array
  create  10000 =  1370 hash, vs. 37315 array
  create 100000 = 15341 hash, vs. 415132 array
  get 0: 0 <-> 0
  get 1000: 2000 <-> 2000
  get abc: no such key <-> can't read "b(abc)": no such element in array
  10x h = 29 microseconds per iteration
  10x a = 20 microseconds per iteration
  h all = 557448 microseconds per iteration
  a all = 556954 microseconds per iteration
  same = 1
  h list = 216386 microseconds per iteration
  a list = 542398 microseconds per iteration